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Abstract 



These lecture notes were written with the aim to provide an accessible though technically solid introduction 
to the logic of systematical analyses of statistical data to undergraduate and postgraduate students in the 
Social Sciences and Economics in particular. They may also serve as a general reference for the application 
of quantitative-empirical research methods. In an attempt to encourage the adoption of an interdisciplinary 
perspective on quantitative problems arising in practice, the notes cover the four broad topics (i) descriptive 
statistical processing of raw data, (ii) elementary probability theory, mainly as seen from a frequentist's 
viewpoint, (iii) the operationalisation of one-dimensional latent variables according to Likert's widely used 
scaling approach, and (iv) the standard statistical test of hypotheses concerning (a) distributional differences 
of variables between subgroups of a population, and (b) statistical associations between two variables. The 
lecture notes are fully hyperlinked, thus providing a direct route to original scientific papers as well as to 
interesting biographical information. They also list many commands for activating statistical functions and 
data analysis routines in the software packages SPSS, R and EXCEL/OPEN OFFICE. 



These lecture notes were typeset in ETpX2e:. 
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Learning Outcomes for 0.1.3 SCIE 



Students who have successfully participated in this module will be able to: 

• appropriately apply methods and work techniques of empirical research and adequately im- 
plement qualitative and quantitative methods of analysis (e.g. frequency distributions, mea- 
sures of central tendency, variance and association, correlation between two variables, linear 
regression). 

• understand and describe different approaches to the philosophy of science and epistemol- 
ogy; explain the relationship between the philosophy of science and standards of academic 
research in the management, economic and social sciences. 

• prepare texts, graphs, spreadsheets and presentations using standard software; thereby, be 
able to communicate in an academically suitable manner as well as convincingly present 
results. 
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Learning Outcomes for 0.3.2 RESO 



Students who have successfully participated in this module will be able to: 

• present the execution of strategic planning within the context of the management process via 
the selection, procurement, allocation, deployment and organisation of financial and human 
resources. 

• explain the term resources in the context of a "resource-based view". 

• assess, allocate suitably depending on the situation and develop various resources from a 
general management perspective in the context of varying conditions ("constraints"), strate- 
gies and conflict situations ("tensions"). 

• apply different methods of researching and making decisions regarding the procurement 
measures required in a company. 

• describe the tasks and instruments of financial management (financial consequences of 
productivity-based decisions, alternative forms of financing, short and long-term financial 
and liquidity planning, capital expenditure budgeting including its mathematical principles). 

• understand the role of human resource management within the context of general manage- 
ment, explain and critically question the most important structures and processes of HRM 
and apply selected methods and tools of personnel management. 

• present the basic functional, institutional and behaviour-related aspects of the organisation, 
give a basic outline of research in the field of organisational theory and discuss various 
theoretical approaches. 

• analyse the composition of the organisation and its formal structure, interpret the objectives 
and conditions of structuring an organisation and assess organisation structures with a view 
to the situation and cultural context. 
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Learning Outcomes for 1.3.2 MARE 



Students who have successfully participated in this module will be able to: 

• gather, record and analyze data in order to identify marketing opportunities and challenges. 

• distinguish between market research of the environmental conditions affecting offer and de- 
mand and marketing research (of the processes involved in attracting customers and main- 
taining their loyalty). 

• define what is relevant, reliable and valid information for understanding consumer needs. 

• appreciate the difference between investigating local consumer needs and those across re- 
gional or global markets. 

• apply research methods suitable for understanding consumer preferences, attitudes and be- 
haviours in relation to national as well as international contexts; in particular, be able to 
take into account cultural differences when gathering and interpreting consumer needs in 
different countries. 

• access how changes in the elements of the Marketing Mix affect customer behaviour. 
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Learning Outcomes for 1.4.2 INOP 



Students who have successfully participated in this module will be able to: 

• understand how international firms organize their foreign operations. 

• comprehend the complexities involved in global sourcing and explain when it is appropriate. 

• apply standard concepts, methods and techniques for making decisions on international op- 
erations and worldwide logistics. 

• apply probability theory and inferential statistics, in order to resolve questions of production 
planning and control. 

• perform sample tests of statistical hypothesis. 

• evaluate best practice cases in outsourcing and offshoring. 

• analyse current trends in the relocation of productive MNC activities. 

• understand the importance of the operations management in order to remain competitive in 
international markets. 
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Introductory remarks 



Statistical methods of data analysis form the cornerstone of quantitative-empirical research in 
the Social Sciences, Humanities, and Economics. Historically, the bulk of knowledge available 
in Statistics emerged in the context of the analysis of (large) data sets from observational and 
experimental measurements in the Natural Sciences. The purpose of the present lecture notes 
is to provide its readers with a solid and thorough, though accessible introduction to the basic 
concepts of Descriptive and Inferential Statistics. When discussing methods relating to the latter 
subject, we will take the perspective of the classical frequentist approach to probability theory. 

The concepts to be introduced and the topics to be covered have been selected in order to make 
available a fairly self-contained basic statistical tool kit for thorough analysis at the univariate and 
bivariate levels of complexity of data gained by means of opinion polls or surveys. In this respect, 
the present lecture notes are intended to specifically assist the teaching of statistical methods of data 
analysis in the bachelor degree programmes offered at Karlshochschule International University. 
In particular, the contents have immediate relevance to solving problems of a quantitative nature 
in either of (i) the year 1 and year 2 general management modules 

• 0.1.3 SCIE: Introduction to Scientific Research Methods 

• 0.3.2 RESO: Resources: Financial Resources, Human Resources, Organisation, 

and (ii) the year 2 special modules of the IB study programme 

• 1.3.2 MARE: Marketing Research 

• 1.4.2 INOP: International Operations. 

In the Social Sciences, Humanities, and Economics there are two broad families of empirical 
research tools available for studying behavioural features of and mutual interactions between hu- 
man individuals on the one-hand side, and social systems and organisations they form on the other. 
Qualitative-empirical methods focus their view on the individual with the aim to account for 
her/his/its particular characteristic features, thus probing the "small scale- structure" of a social 
system, while quantitative-empirical methods strive to recognise patterns and regularities that 
pertain to a large number of individuals and so hope to gain insight on the "large-scale structure" 
of a social system. 

Both approaches are strongly committed to pursuing the principles of the scientific method. These 
entail the systematic observation and measurement of phenomena of interest on the basis of well- 
defined variables, the structured analysis of data so generated, the attempt to provide compelling 
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theoretical explanations for effects for which there exists conclusive evidence in the data, the 
derivation from the data of predictions which can be tested empirically, and the publication of all 
relevant data and the analytical and interpretational tools developed so that the pivotal reproducibil- 
ity of a researcher's findings and conclusions is ensured. By complying with these principles, the 
body of scientific knowledge available in any field of research and in practical applications of 
scientific insights undergoes a continuing process of updating and expansion. 

Having thoroughly worked through these lecture notes, a reader should have obtained a good 
understanding of the use and efficiency of standard statistical methods for handling quantitative 
issues as they often arise in a manager's everyday business life. Likewise, a reader should feel well- 
prepared for a smooth entry into a Master degree programme which puts emphasis on quantitative- 
empirical methods. 

Following a standard pedagogical concept, these lecture notes are split into three main parts: Part I, 
which comprises Chapters [T] to |5] covers the basic considerations and tools of Descriptive Statis- 
tics; Part II, which consists of Chapters [6J to [8} introduces the foundations of Probability Theory. 
Finally, the material of Part III, provided in Chapters |9]to[T2l first reviews a widespread method for 
operationalising latent variables, and then introduces a number of standard uni- and bivariate ana- 
lytical methods of Inferential Statistics that prove particularly valuable in practical applications. 
As such, the contents of Part III are the most important ones for quantitative-empirical research 
work. Useful mathematical tools have been gathered in an appendix. 

Recommended introductory textbooks, which may be used for study in parallel to these lecture 
notes, are Levin etal (2010) flU, Hatzinger and Nagel (2009) & Wewel (2008) [601, Toutenburg 
(2005) 115711 . or Duller (2007) [fTTTl . These textbooks, as well as many of the monographs listed in 
the bibliography, are available in the library of Karlshochschule International University. 

There are not included any explicit examples or exercises on the topics discussed. These are 
reserved to the lectures given throughout term time in any of the modules mentioned. 

The present lecture notes are designed to be dynamical in character. On the one-hand 
side, this means that they will be updated on a regular basis. On the other, they contain 
interactive features such as fully hyperlinked references, as well as, in the *.pdf version, 
many active links to biographical references of scientists that have been influential in 
the historical development of Probability Theory and Statistics, hosted by the websites 
|The MacTutor History of Mathematics archive (www-hi story . mcs . st-and . ac . uk)| and 
|en . wikipedia . org[ 

Lastly, throughout the text references have been provided to respective descriptive and inferen- 
tial statistical functions and routines that are available on a standard graphic display calculator 
(GDC), the statistical software packages EXCEL/OPEN OFFICE and SPSS, and, for more tech- 
nically inclined readers, the widespread statistical software package R. The latter can be obtained 
as shareware from |cran . r-pro ject . org| and has been employed for generating the figures 
included in the text. A useful and easily accessible textbook on the application of R for statistical 
data analysis is, e.g., Dalgaard (2008) [TTOTl . Further helpful information and assistance is available 
from the website lwww . r-tutor . com! 

Acknowledgments: I am grateful to Kai Holschuh and Eva Kunz for valuable comments on an 
earlier draft of these lecture notes. 



Chapter 1 
Statistical variables 



A central aim of empirical scientific disciplines is the observation of characteristic variable fea- 
tures of a given system of objects chosen for study, and the attempt to recognise patterns and reg- 
ularities which indicate associations, or, stronger still, causal relationships between them. Based 
on a combination of inductive and deductive methods of analysis, the hope is to gain insight of 
a qualitative and/or quantitative nature into the intricate and often complex interdependencies of 
such features for the purpose of deriving predictions which can be tested. It is the interplay of 
experimentation and theoretical modelling, coupled to one another by a number of feedback loops, 
which generically gives rise to progress in learning and understanding in all scientific activities. 

More specifically, the intention is to modify or strengthen the theoretical foundations of a scien- 
tific discipline by means of observational and/or experimental falsification of sets of hypotheses. 
This is generally achieved by employing the quantitative-empirical techniques that have been de- 
veloped in Statistics, in particular in the course of the 20 th Century. At the heart of these techniques 
is the concept of a statistical variable X as an entity which represents a single common aspect of 
the system of objects selected for analysis, the population Q of a statistical investigation. In the 
ideal case, a variable entertains a one-to-one correspondence with an observable, and thus is di- 
rectly amenable to measurement. In the Social Sciences, Humanities, and Economics, however, 
one needs to carefully distinguish between manifest variables corresponding to observables on 
the one-hand side, and latent variables representing in general unobservable "social constructs" 
on the other. It is this latter kind of variables which is commonplace in the fields mentioned. 
Hence, it becomes necessary to thoroughly address the issue of a reliable, valid and objective op- 
erationalisation of any given latent variable one has identified as providing essential information 
on the objects under investigation. A standard approach to dealing with this important matter is 
reviewed in Ch. [9] 

In Statistics, it has proven useful to classify variables on the basis of their intrinsic information 
content into one of three hierachically ordered categories, referred to as scale levels. We provide 
the definition of these scale levels next. 
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CHAPTER 1. STATISTICAL VARIABLES 



1.1 Scale levels 

Def.: Let X be a 1-D statistical variable with k 6 N resp. HI possible values, attributes, 

or categorical levels aj (j = 1, . . . , k). Statistical variables are classified into one of three hier- 
achically ordered scale levels of measurement according to up to three criteria for distinguishing 
between the possible values or attributes they may take, and the kind of information they contain. 
One thus defines: 

• Metrically scaled variables X (quantitative/numerical) 

Possible values can be distinguished by 

(i) their names, ai ^ aj, 

(ii) they allow for a natural ordering, ai < aj, and 

(iii) distances between them, a; — aj, are uniquely determined. 

- Ratio scale: X has an absolute zero point and otherwise only non-negative values; 
analysis of both differences a^ — aj and ratios ai/aj is meaningful. 

Examples: body height, monthly net income, 

- Interval scale: X has no absolute zero point; only differences a^ — aj are meaningful. 
Examples: year of birth, temperature in centigrades, 

• Ordinally scaled variables X (qualitative/categorical) 

Possible values or attributes can be distinguished by 

(i) their names, a« ^ aj, and 

(ii) they allow for a natural ordering, a^ < aj. 

Examples: 5-level Likert item rating scale [Rensis Likert (1903-1981), USA], grading of 
commodities, 

• Nominally scaled variables X (qualitative/categorical) 

Possible values or attributes can be distinguished only by 

(i) their names, a^ ^ aj. 
Examples: first name, location of birth, 



Remark: Note that the applicability of specific methods of statistical data analysis, some of 
which will be discussed in Ch.[jj]and[T2]below, crucially depends on the scale level of the variables 
involved in the procedures. Metrically scaled variables offer the largest variety of useful methods! 
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1.2 Raw data sets and data matrices 

To set the stage for subsequent considerations, we here introduce some formal representations 
of entities which assume central roles in statistical data analyses. Let Q denote the population 
of study objects of interest (e.g., human individuals forming a particular social system) relating to 
some statistical investigation. This set f2 shall comprise a total of N G N statistical units, i.e., its 
size be |f2| = N. Suppose one intends to determine the distributional properties in Q of m G N 
statistical variables X, Y, . . . , and Z, with spectra of values ai, a 2 , • • • , a k , b±, b 2 , ■ ■ ■ , b t , . . . , and 
Ci, c 2 , . . . , c p , respectively (k, l,p G N). A survey typically obtains from f2 a statistical sample 
Sa of size \Sq\ = n (n G N, n < N), unless one is given the rare opportunity to conduct a proper 

census on ft. The data thus generated consists of observed values {rci}i = i n , {yi}i=i,..., n , ■ ■ 

and {Zi}i=i r .. jn . It constitutes the multivariate raw data set y%,. ■ ■, 2i)}i=i,..., n of a statistical 
investigation and may be conveniently assembled in the form of an (n x m) data matrix X 
given by 



sampling 
unit 


variable 

X 


variable 

Y 




variable 

Z 


1 


x 1 = a 5 


Vi = b 9 




Zl = c 3 


2 


x 2 = a 2 


V2 = b i2 




Z2 = C 8 












n 


x n = a 8 


Vn = b 9 




Z n — C15 



For recording information obtained from a statistical sample Sn, in this matrix scheme every one of 
the n sampling units investigated is allocated a particular row, while every one of the m statistical 
variables measured is allocated a particular column; in the following, denotes the data entry in 
the ith row (i = 1, . . . , n) and the jth column (i = 1, . . . , m) of X. In general, a (n x m) data 
matrix X is the starting point for the application of a statistical software package such as SPSS 
or R for the purpose of systematic data analysis. Note that in the case of a sample of exclusively 
metrically scaled data, X G IR nxm ; cf. the lecture notes Ref. [Q21 Sec. 2.1]. 

We next turn to describe phenomenologically the distributional properties of a single 1-D statis- 
tical variable X in a specific statistical sample S^i of size n, drawn in the context of a survey from 
some population of study objects f2 of size N. 



CHAPTER 1. STATISTICAL VARIABLES 



Chapter 2 

Frequency distributions 



The first task at hand in unravelling the intrinsic structure which resides in a given raw data set 
{xi} i= i t ..^ n for some statistical variable X corresponds to Cinderella's task of separating the good 
peas from the bad peas, and collecting them in respective bowls (or bins). This is to say, the first 
question to be answered requires determination of the frequencies with which a value aj in the 
spectrum of possible values of X occurred in the statistical sample S^. 



2.1 Absolute and relative frequencies 

Def.: Let X be a nominally, ordinally or metrically scaled 1-D statistical variable, with a spec- 
trum of k different values or attributes aj resp. k different categories (or bins) Kj (j — 1, . . . , k). 
If, for X, we have a raw data set comprising n observed values {£i}i=i,..., n , we define by 

{°n( a i) = number of xi with xi = aj 
(2.1) 
o n (Kj) = number of x^ with Xi G Kj 

(j = 1, . . . , k) the absolute (observed) frequency of aj resp. Kj, and, upon division of the Oj by 
the sample size n, we define by 

On(dj) 



hj :-- 



n 

(2.2) 

O n (Kj) 



n 



(j = 1, . . . , k) the relative frequency of aj resp. Kj. Note that for all j = 1, . . . , k, we have < 



3 ' 

k k 



Oj < n with ^ Oj = n, and < hj < 1 with ^ hj = 1. The k value pairs (aj, Oj)j=i ; ... ! k resp. 

j=i i=i 
(Kj, Oj)j = i ; ... jfe represent the distribution of absolute frequencies, the k value pairs (aj, hj)j =lt ,„ jk 

resp. (Kj, hj)j=i t ... ! k represent the distribution of relative frequencies of the aj resp. Kj in S^. 

EXCEL: FREQUENCY (dt: HAUFIGKEIT) 

SPSS: Analyze — > Descriptive Statistics — > Frequencies . . . 
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CHAPTER 2. FREQUENCY DISTRIBUTIONS 



Typical graphical representations of relative frequency distributions, regularly employed in mak- 
ing results of descriptive statistical data analyses public, are the 

• histogram for metrically scaled data, 

• bar chart for ordinally scaled data, 

• pie chart for nominally scaled data. 

It is standard practice in Statistics to compile from the relative frequency distribution 
(a,, hj)j=i t ... : k resp. (Kj, foj)j=i,...,fc of data for some ordinally or metrically scaled 1-D variable X 
the associated empirical cumulative distribution function. Hereby it is necessary to distinguish the 
case of data for a variable with a discrete spectrum of values from the case of data for a variable 
with a continuous spectrum of values. We will discuss this issue next. 

2.2 Empirical cumulative distribution function (discrete data) 

Def.: Let X be an ordinally or metrically scaled 1-D statistical variable, the spectrum of val- 
ues cij ( j = 1, . . . , k) of which vary discretely. Suppose given for X a statistical sample Sa of 
size \Sfi\ = n comprising observed values {xi}i=i,...,n, which we assume ordered in increasing 
fashion according to a>i < a 2 < . . . < a k . The corresponding relative frequency distribution is 
(aj, /ij)j=i,...,fc. For all real numbers x G K, we then define by 








for x < a>i 




F n {x) := < 


j 


for a.j < x < cbj+i 


(j = l,...,k-l) 




i=i 








i 

V 


for x > ak 





the empirical cumulative distribution function for X. The value of F n at x e M represents the 
cumulative relative frequencies of all aj which are less or equal to x. F n (x) has the following 
properties: 

• its domain is D(F n ) = R, and its range is W(F n ) = [0, 1]; hence, F n is bounded from above 
and from below, 

• it is continuous from the right and monotonously increasing, 

• it is constant on all half-open intervals [a 3 -, %+i), but exhibits jump discontinuities at all 

of size h n (a,j + i), and, 

• asymptotically, it behaves as lim F n (x) = and lim F n (x) — 1. 

X^r — OO X— 5>+00 
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Computational rules for F n (x) 

1. h(x<d) = F n (d) 

2. h(x <d) = F n (d) - h n {d) 

3. h(x>c) = l- F n (c) + h n {c) 

4. h(x> c) = 1- F n (c) 

5. h(c<x <d) = F n (d) - F n (c) + h n (c) 

6. h{c < x < d) = F n (d) - F n (c) 

1. h(c<x<d)= F n (d) - F n (c) - h n (d) + h n (c) 
8. h(c < x < d) = F n (d) - F n (c) - h n (d), 

wherein c denotes an arbitrary lower bound, and d denotes an arbitrary upper bound, on the 
argument x of F n (x). 

2.3 Empirical cumulative distribution function (continuous 
data) 

Def.: Let X be a metrically scaled 1-D statistical variable, the spectrum of values of which vary 
continuously, and let observed values {£i}i=i,..., n for X from a statistical sample 5n of size 
\Sq\ = n be binned into k class intervals (or bins) Kj (j = 1, . . . , k), of width bj, with lower 
boundary Uj and upper boundary oj. The distribution of relative frequencies of the class intervals 
be (Kj, hj)j =lr „ jk . Then, for all real numbers x G K, 





f 




for x < ui 


F n (x) := < 


3-1 h 

i=i °i 


for x G Kj 




l 


for x > Ok 



defines the empirical cumulative distribution function for X. F n (x) has the following proper- 
ties: 

• its domain is D(F n ) = R, and its range is W(F n ) = [0, 1]; hence, F n is bounded from above 
and from below, 

• it is continuous and monotonously increasing, and, 
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• asymptotically, it behaves as lim F n (x) = and lim F n (x) — 1. 

Computational rules for F n (x) 

1. < d) = < d) = F n (d) 

2. h(x > c) = h(x > c) = 1 - F n (c) 

3. h(c < x < d) = h(c < x < d) = h(c < x < d) = h(c < x < d) = F n (d) - F n (c), 

wherein c denotes an arbitrary lower bound, and d denotes an arbitrary upper bound, on the 
argument x of F n (x). 

Our next step is to introduce a set of scale level-dependent standard descriptive measures which 
characterise specific properties of univariate and bivariate relative frequency distributions of statis- 
tical variables X resp. (X, Y). 



Chapter 3 



Descriptive measures for univariate 
distributions 

There are four families of scale level-dependent standard measures one employs in Statistics to 
describe characteristic properties of univariate relative frequency distributions. We will introduce 
these in turn. In the following we suppose given from a survey for some 1-D statistical variable X 
either (i) a raw data set {£i}i=i,..., n of n measured values, or (ii) a relative frequency distribution 

(aj, /',), i /, resp. (Kj, hj), , /,. 

3.1 Measures of central tendency 

Let us begin with the measures of central tendency which intend to convey a notion of "middle" 
or "centre" of a univariate relative frequency distribution. 

3.1.1 Mode 

The mode x mod (nom, ord, metr) of the relative frequency distribution of any 1-D variable X is 
that value aj in X's spectrum which occurred with the highest measured relative frequency in a 
statistical sample S^. Note that the mode does not necessarily take a unique value. 

Def.: ^n(^mod) > h n (a,j) for all j — 1, . . . , k. 

EXCEL: MODE (dt.: MODUS . EINF, MODALWERT) 

SPSS: Analyze — > Descriptive Statistics — > Frequencies . . . — > Statistics . . . : Mode 

3.1.2 Median 

To determine the median 5 .5 ( or Q2) (ord, metr) of the relative frequency distribution of an or- 
dinally or metrically scaled 1-D variable X, it is necessary to first bring the n observed values 
{£j}j=i,...,n into their natural hierarchical order, i.e., < x@) < . . . < X( n ). 

Def.: For the sequentially ordered n observed values {xi} i=1: ___ :n , at most 50% have a rank lower 
or equal to resp. are less or equal to the median value x , 5 , and at most 50% have a rank higher or 
equal to resp. are greater or equal to the median value x , 5 . 
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(i) Discrete data 



F n {x . 5 ) > 0.5 



2F(f) + x (f + i) 



if n is odd 
if n is even 



(3.1) 



(ii) Binned data 



F n (x .i 



0.5 



i-l 



The class interval l-Q contains the median value a; .5 5 if fyj < 0.5 and foj > 0.5. Then 

i=i i=i 



i-1 



(3.2) 



Alternatively, the median of a statistical sample for a continuous variable X with binned 
data (fC,, hj)j =l jk can be obtained from the associated empirical cumulative distribution 

function by solving the condition F n (x . 5 ) = 0.5 for x , 5 ; cf. Eq. (12.41) F1 

Remark: Note that the value of the median of a relative frequency distribution is fairly insensitive 
to so-called outliers in a statistical sample. 

EXCEL: MEDIAN (dt.: MEDIAN) 

SPSS: Analyze — > Descriptive Statistics — >■ Frequencies ...—)■ Statistics . . . : Median 
R: median (variable) 



3.1.3 a-Quantile 

A generalisation of the median is the concept of the a-quantile x a (ord, metr) of the relative 
frequency distribution of an ordinally or metrically scaled 1-D variable X. Again, it is necessary 
to first bring the n observed values {£i}i=i,...,„ into their natural hierarchical order, i.e., x^ < 

X(2) < • • • < X(n). 

Def.: For the sequentially ordered n observed values {xj} i= i for given a with < a < 1 at 
most ax 100% have a rank lower of equal to resp. are less or equal to the a-quantile x a , and at 
most (1 — a) x 100% have a rank higher or equal to resp. are greater or equal to the a-quantile x a . 



(i) Discrete data 



FjXry) > a 



( , Z(fc) if na £ N, k > na ^ 

+ x (fc+ i)] if k = na e N 



'From a mathematical point of view this amounts to the following problem: consider a straight line which contains 
the point with coordinates (xo, yo) and has non-zero slope y'(xo) ^ 0, i.e., y = yo + y'(xo)(x — xq). Re-arranging to 
solve for the variable x then yields x = xq + [y' (xq)] -1 (y — yo)- 
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(ii) Binned data 



i— 1 i 

The class interval Ki contains the a-quantile x a , if ^ hj < a and ft,.,- > a. Then 

i=i 3=1 



h 
hi 



x a = Uj + I a - hj ] . (3.4) 



Alternatively, an a-quantile of a statistical sample 5^ for a continuous variable X with 
binned data (Kj, hj)j=\ ^ can be obtained from the associated empirical cumulative distri- 
bution function by solving the condition F n ( for x a ; cf. Eq. (12.41) . 

Remark: The quantiles x .25, 5o.5 5 ^o.75 (also denoted by Q u Q 2 , Q3) have special status. They 
are referred to as the first quartile — >■ second quartile (median) — >■ third quartile of a relative 
frequency distribution for an ordinally or a metrically scaled 1-D X and form the core of the five 
number summary of the respective distribution. 

EXCEL: PERCENTILE (dt: QUANTIL . EXKL, QUANTIL) 

SPSS: Analyze — > Descriptive Statistics — > Frequencies ...—>■ Statistics . . . : Percentile(s) 
R: quantile (variable, a) 



3.1.4 Five number summary 

The five number summary (ord, metr) of the relative frequency distribution of an ordinally or 
metrically scaled 1-D variable X is a compact compilation of information giving the (i) lowest 
rank resp. smallest value, (ii) first quartile, (iii) second quartile or median, (iv) third quartile, and 
(v) highest rank resp. largest value that X takes in a raw data set {£i}i=i,..., n from a statistical 
sample Sn, i.e., 

{X(1),X .25, X0.5, ^0.75, X(n)} ■ (3.5) 

Alternative notation: {Q , Q 1 , Q 2 , Q 3 , Q 4 }. 

EXCEL: MIN, QUARTILE, MAX (dt.: MIN, QUARTILE . EXKL, QUARTILE, MAX) 

SPSS: Analyze — > Descriptive Statistics — >■ Frequencies ...—)■ Statistics . . . : Quartiles, Minimum, 

Maximum 

R: quantile (variable) 

A very convenient graphical method for transparently displaying distributional features of metri- 
cally scaled data relating to a five number summary is provided by a box plot; see, e.g., Tukey 
(1977) 11591 

All measures of central tendency which now follow are defined exclusively for characterising rel- 
ative frequency distributions of metrically scaled 1-D variables X only. 
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3.1.5 Arithmetical mean 

The best known measure of central tendency is the dimensionful arithmetical mean x (metr). 
Given adequate statistical data, it is defined by: 



(i) From raw data set: 



x :- 



1 1 n 

- (X! + . . . + X n ) -. - Y] 



(ii) From relative frequency distribution: 



(3.6) 



x := aih n (ai) + . . . + a k h n (a k ) =: ^a J 7i„(a J ) 

3=1 



(3.7) 



Remarks: (i) The value of the arithmetical mean is very sensitive to outliers. 

(ii) For binned data one selects the midpoint of each class interval Ki to represent the aj (provided 

the raw data set is no longer accessible). 



EXCEL: AVERAGE (dt: MITTELWERT) 

SPSS: Analyze — > Descriptive Statistics — > Frequencies ...—>■ Statistics 
Rimean {variable) 



Mean 



3.1.6 Weighted mean 

In practice, one also encounters the dimensionful weighted mean x w (metr), defined by 



n 

x w := tuixi + . . . + w n x n =: ^ WiXi ; 

i=i 



the n weight factors wi, ...,w n need to satisfy the constraints 

n 

< Wi, . . . , w n < 1 and w\ + . . . + w n = Wi — 1 



(3.8) 



(3.9) 



i=l 



3.2 Measures of variability 

The idea behind the measures of variability is to convey a notion of the "spread" of data in a 
given statistical sample Sri, technically referred to also as the dispersion of the data. As the 
realisation of this intention requires a well-defined concept of distance, the measures of variability 
are meaningful for data relating to metrically scaled 1-D variables X only. One can distinguish 
two kinds of such measures: (i) simple 2-data-point measures, and (ii) sophisticated n-data-point 
measures. We begin with two examples belonging to the first category. 
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3.2.1 Range 

For a raw data set {£i}j=i,...,„ of n observed values for X, the dimensionful range R (metr) simply 
expresses the difference between the largest and the smallest value in this set, i.e., 

R := £(„,) — X(i) . (3.10) 

The basis of this measure is the ordered data set < £( 2 ) < . . . < Alternatively, the range 
can be denoted by R = Q 4 — Q . 

SPSS: Analyze — > Descriptive Statistics — > Frequencies ...—)■ Statistics . . . : Range 



3.2.2 Interquartile range 

In the same spirit as the range, the dimensionful interquartile range cIq (metr) is defined as the 
difference between the third quantile and the first quantile of the relative frequency distribution for 
some X, i.e., 

dQ := X .75 - ^0.25 • (3.11) 

Alternatively, this is d,Q = Qs — Q\. 

3.2.3 Sample variance 

The most frequently employed measure of variability in Statistics is the dimensionful n-data-point 
sample variance s 2 (metr), and the related sample standard deviation to be discussed below. Given 
a raw data set {xj}j=i v .. jn for X, its spread is essentially quantified in terms of the sum of squared 
deviations of the n data points from their common mean x. Due to the algebraic identity 



(#1 - x) + . . . + (x n - X) 



X) 



\ - E( J- ESI n 
y Xi I — nx = (J 



i=i 



i=l 



there are only n — 1 degrees of freedom involved in this measure. The sample variance is defined 
by: 



(i) From raw data set: 



1 1 n 
[ {xi - x) 2 + . . . + {x n - x) 2 ] =: y~] (Xi - xf . 

n — 1 n — 1 L — ' 

i=i 



alternatively, by the shift theoremU 



(3.12) 



2 f 2 i i 2 -2 1 

s = \x l + ... + x n -nx 

n - 1 L J 



n — 1 



5>? 

i=i 



2 -2 

— nx 



(3.13) 



2 That is, the algebraic identity (a 



n ^ ^ n n n 

Y J {x 2 l -2x l x+x 2 ) Eq = Yt^-Y,* 2 = 12 



2 -2 

x, — nx . 
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(ii) From relative frequency distribution: 



alternatively: 



n 



n 



- [ (ai - x) 2 /i n (ai) + . . . + (ajfc - x) 2 /i„(a fe ) ] 



n 



n 



3=1 



ft 

[ a\h n {ax) + . . . + a 2 k h n (a k ) - x 2 1 

n — 1 



r? 



n — 1 



^2a 2 h n ( aj ) 

3=1 



x 2 



(3.14) 



(3.15) 



Remarks: (i) We point out that the alternative formulae for a sample variance provided here prove 
computationally more efficient. 

(ii) For binned data, when one selects the midpoint of each class interval Kj to represent the 
<ij (given the raw data set is no longer accessible), a correction of Eqs. (13.141) and (13.151) by an 
additional term (l/12)(n/n — 1) Y^j=itfhj becomes necessary, assuming uniformly distributed 
data within each class intervals Kj of width bj; cf. Eq. (18.311) . 

EXCEL: VAR (dt.: VAR . S, VARIANZ) 

SPSS: Analyze — > Descriptive Statistics — > Frequencies ...—)■ Statistics . . . : Variance 
var (variable) 



3.2.4 Sample standard deviation 

For ease of handling dimensions associated with a metrically scaled 1-D variable X, one defines 
the dimensionful sample standard deviation s (metr) simply as the positive square root of the 
sample variance, i.e., 

S ;= + V^2, (3.16) 

such that a measure for the spread of data results which shares the dimension of X and its arith- 
metical mean x. 

EXCEL: STDEV (dt.: STABW . S, STABW) 

SPSS: Analyze — > Descriptive Statistics — > Frequencies ...—)■ Statistics . . . : Std. deviation 
sd (variable) 



3.2.5 Sample coefficient of variation 

For ratio scaled 1-D variables X, a dimensionless relative measure of variability is the sample 
coefficient of variation v (metr: ratio), defined by 

v:= - , if x > . (3.17) 

x 
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3.2.6 Standardisation 

Data for metrically scaled 1-D X is amenable to the process of standardisation. By this is meant 
a transformation procedure X — > Z, which generates from data for a dimensionful X, with mean x 
and sample standard deviation sx > 0, data for an equivalent dimensionless variable Z according 
to 



sx 



for all % 



(3.18) 



For the resultant Z-data, referred to as the z-scores of the original metrical X-data, this has the 
convenient consequence that the corresponding arithmetical mean and the sample variance amount 
to 

z = and s % = 1 , 



respectively. 



3.3 Measures of relative distortion 

The third family of measures characterising relative frequency distributions of data {xi}i=i,... )Tl 
for metrically scaled 1-D variables X, having specific mean x and standard deviation sx, take 
a GauBian normal distribution with parameter values equal to mean x and sx as a reference 
case (cf. Sec. 18.51 below). With respect to this reference distribution, one defines two kinds of 
dimensionless measures of relative distortion as described in the following. 



3.3.1 Skewness 

The skewness gi (metr) is a dimensionless measure to quantify the degree of relative distortion of 
a given frequency distribution in the horizontal direction. For n > 2, 

n / -\ 3 
(n -l)(n- 2) s x J 

wherein the observed values {xj} i= i n enter standardised according to Eq. (13 .18b . Note that 
gi = for an exact GauBian normal distribution. 

EXCEL: SKEW (dt: SCHIEFE) 

SPSS: Analyze — > Descriptive Statistics — > Frequencies ...—)■ Statistics . . . : Skewness 
R: skewness (variable) 



3.3.2 Excess kurtosis 

The excess kurtosis g 2 (metr) is a dimensionless measure to quantify the degree of relative distor- 
tion of a given frequency distribution in the vertical direction. For n > 3, 

n(n + l) ^f xj-x \ 4 3(n-l) 2 

92 ' (n-l)(n-2)(n-3) \ s x J (n-2)(n-3)' K ' 
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wherein the observed values {xj}i=i,..., n enter standardised according to Eq. (13. 18b . Note that 
g 2 = for an exact GauBian normal distribution. 

EXCEL: KURT (dt.: KURT) 

SPSS: Analyze — > Descriptive Statistics — > Frequencies ...—)■ Statistics . . . : Kurtosis 
R^kurtosis (variable) 



3.4 Measures of concentration 

Finally, for data {xj}j=i „ relating to a ratio scaled 1-D variable X, which has a discrete spectrum 
of values {a,}j=i,...,fc or was binned into k different categories {Kj}j =lt k with respective mid- 
points dj, two kinds of measures of concentration are commonplace in Statistics; one qualitative 
in nature, the other quantitative. 

Begin by defining the total sum for the data {xj} i=1 by 

S := Xi = cijO n (aj) C ' =' nx , (3.21) 

i=i j=i 

where (a,-, o n (aj))j =1 ^ is the absolute frequency distribution of the observed values (or cate- 
gories) of X. Then the relative proportion that the value dj (or the category Kj) takes in S 
is 

QjQn(Qj) = ajhn(aj) (3 22 

S x 



3.4.1 Lorenz curve 

From the elements introduced in Eqs. (13.211 ) and (13. 221 ). the US-American economist 
Max Otto Lorenz (1876-1959) constructed cumulative relative quantities which constitute the co- 
ordinates of a so-called Lorenz curve representing concentration in the distribution of the ratio 
scaled 1-D variable X. These coordinates are defined as follows: 

• Horizontal axis: 

ki ■= E ^ = E (z = 1, ...,&) , (3.23) 

3=1 U 3=1 



Vertical axis: 



^^^^aA^ (i = l,. ..,*). (3.24) 



3=1 3=1 



The initial point on a Lorenz curve is generally the coordinate system's origin, (k Q , l Q ) = (0, 0), the 
final point is (1, 1). As a reference to measure concentration in the distribution of X in qualitative 
terms, one defines a null concentration curve as the bisecting line linking (0, 0) to (1, 1). The 
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Lorenz curve is interpreted as stating that a point on the curve with coordinates [ki, k) represents 
the fact that ki x 100% of the n statistical units take a share of x 100% in the total sum S 
for the ratio scaled 1-D variable X. Qualitatively, for given data {a;i}i=i,...,n> the concentration 
in the distribution of X is the stronger, the larger is the dip of the Lorenz curve relative to the 
null concentration curve. Note that in addition to the null concentration curve, one can define as a 
second reference a maximum concentration curve such that only the largest value a k (or category 
K k ) in the spectrum of values of X takes the full share of 100% in the total sum S for {£i}j=i,... in . 



3.4.2 Normalised Gini coefficient 



The Italian statistician, demographer and sociologist Corrado Gini (1884-1965) devised a quanti- 
tative measure for concentration in the distribution of a ratio scaled 1-D variable X. The dimen- 
sionless normalised Gini coefficient G + (metr: ratio) can be interpreted geometrically as the ratio 
of areas 



(area enclosed between Lorenz and null concentration curves) 
(area enclosed between maximum and null concentration curves) 

Its related computational definition is given by 



(3.25) 




(3.26) 



Due to normalisation, the range of values is < G + < 1. Thus, null concentration amounts to 
G + = 0, while maximum concentration amounts to G + = lH 



In September 2012 it was reported (implicitly) in the public press that the coordinates underlying the Lorenz curve 
describing the distribution of private equity in Germany at the time were (0.00, 0.00), (0.50, 0.01), (0.90, 0.50), and 
(1.00, 1.00); cf. Ref. [153]. Given that in this case these values amount to a Gini coefficient of G + = 0.64. 
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Chapter 4 



Descriptive measures of association for 
bivariate distributions 



Now we come to describe and characterise specific features of bivariate frequency distributions, 
i.e., intrinsic structures of raw data sets {(x i: yi)}i=i,..., n obtained from statistical samples So for 
2-D variables (X, Y) from some population of study objects fl. Let us suppose that the spectrum 
of values resp. categories of X is ai, a 2 , . . . , a k , and the spectrum of values resp. categories of Y 
is b 1: b 2 , ■ ■ ■ , bi, where k,l e N. Hence, for the bivariate joint distribution there exists a total 
of k x I possible combinations {(a i: 6j)}i=i,-,fey'=i,-,i °f vames res P- categories for (X,Y). In 
the following we will denote associated absolute (observed) frequencies by Oij := o n (aj, 6-,), and 
relative frequencies by := /i n (aj, bj). 



4.1 (k X I) contingency tables 



Consider a raw data set {(x i: yi)}i=i,..., n for a 2-D variable (X, Y) giving rise to k x / combinations 
of values resp. categories {{a h bj)} i= \^k;j=i,...,i- The bivariate joint distribution of observed ab- 
solute frequencies Ojj may be conveniently represented in terms of a (k x I) contingency table 
(or cross tabulation) by 



Oij 




b 2 ■ 


. b, . 


■ h 


Ej 


dl 


On 


on ■ 


■ oij ■ 


■ on 


Ol+ 




021 


022 ■ 


■ °2j ■ 


■ 2 l 


02+ 


aj 


Oil 


Oi2 ■ 


■ Oij . 


■ On 


o i+ 




Okl 


Ok2 ■ 


■ o kj . 


■ o kl 


o k + 




o+i 


0+2 ■ 


■ °+j ■ 


■ o +l 


n 



where it holds for alH = 1, . . . , k and j = 1, 



. , / that 



< o^ < n 



and 



k i 

EE 

i=i j=i 



(4.1) 



= n . 



(4.2) 
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The corresponding marginal absolute frequencies of X and of Y are 



o i+ := o a + o i2 + ... + Oij + ... + on =■ 

3=1 

k 

o +j := o X j + o 2j + ■ ■ ■ + Oij + . . . + o kj =: ^ ■ 

i=i 

SPSS: Analyze — > Descriptive Statistics — > Crosstabs . . . — > Cells . . . : Observed 



(4.3) 
(4.4) 



One obtains the related bivariate joint distribution of observed relative frequencies h^ following 
the systematics of Eq. (12.21) to yield 





h 


b 2 ■ 


. b, . 


■ 6, 


S, 


ai 




hn ■ 


. h Xj . 


• hn 


h 1+ 


a 2 


hi 


h 2 2 ■ 


■ h 2j . 


■ h 2i 


h 2+ 


ai 


hn 


hi2 • 


hij 


■ h a 


h i+ 


ak 


hki 


hk2 ■ 


■ h k j ■ 


■ hki 


hk+ 




h+i 


h+2 ■ 


. h +j . 


■ h+i 


1 



(4.5) 



Again, it holds for alH = 1, . . . , k and j — 1, . . . , I that 



k I 



< hij < 1 and V V h tJ = 1 

i=l j=l 



(4.6) 



while the marginal relative frequencies of X and of Y are 

h i+ := hn + h i2 + . . . + h^ + . . . + hu 



h 



+3 



hij + h 2j + . . . + ha + . . . + h 



3=1 
k 



ij ■ 



(4.7) 
(4.8) 



i=l 



On the basis of a (k x I) contingency table displaying the relative frequencies of the bivariate 
joint distribution of some 2-D (X, Y), one may define two kinds of related conditional relative 
frequency distributions, namely (i) the conditional distribution of X given Y by 



h(a,i\bj) •— "' J 



h 



+j 



and (ii) the conditional distribution of Y given X by 



hi- 

h{bj\ai) := -f- . 



(4.9) 



(4.10) 
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Then, by means of these conditional distributions, a notion of statistical independence of variables 
X and Y is defined to correspond to the simultaneous properties 

h(di\bj) = h(di) = h i+ and h(bj\ai) = h(b 3 ) = h + j . (4.11) 

Given these properties hold, it follows from Eqs. (14 .9t and (14.101) that 

hij = h i+ h +j . (4.12) 



4.2 Measures of association for the metrical scale level 

Next, specifically consider a raw data set {(xj, yi)}i=i,..., n from a statistical sample Sn for some 
metrically scaled 2-D variable (X, Y). The bivariate joint distribution of (X, Y) in this sample 
can be conveniently represented graphically in terms of a scatter plot. Let us now introduce two 
kinds of measures for the description of specific characteristic features of such distributions. 



4.2.1 Sample covariance 

The first standard measure characterising bivariate joint distributions of metrically scaled 2-D 
(X, Y) descriptively is the dimensionful sample covariance sxy (metr), defined by 

(i) From raw data set: 



sxy 



n 



n 



— [(x 1 -x)(y 1 -y) + ...+ (x n - x)(y n - y) } 
1 n 

— (a?i - x) (yi - y) , 



alternatively: 



sxy = r [xiyi + ■ ■ ■ +x n y n - nxy 

n — 1 



^2 X iVi ~ nX V 



n — 1 



8=1 



(ii) From relative frequency distribution: 



n 



sxy 



n — 1 



(at - x)(bi - y)hu + ■ ■ ■ + (a k - x)(h - y)h M ] 



k I 



alternatively: 



n 



sxy 



i=l j=l 



n 



n — 1 

n 
n — 1 



[a-ibxhn + . . . + a k bih kl - xy 



k I 

YY a ^ hi - 
L i=l 3=1 



xy 



(4.13) 



(4.14) 



(4.15) 



(4.16) 
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Remark: The alternative formulae provided here prove computationally more efficient. 
EXCEL: COVAR (dt: KOVARIANZ . S) 

It is worthwhile to point out that in the research literature it is standard to define for bivariate 
joint distributions of metrically scaled 2-D (X, Y) a dimensionful symmetric (2 x 2) covariance 
matrix S according to 



S :-- 



SXY 



SXY 



(4.17) 



the components of which are defined by Eqs. (T3.121 ) and (I4.131 ). The determinant of S, given by 
det(S) = s x s Y — s XY , is positive as long as s x s Y — s XY > which applies in most practical 
cases. Then S is regular, and thus a corresponding inverse S~ x exists; cf. Ref. lfT2l Sec. 3.5]. 

The concept of a regular covariance matrix S and its inverse S~ x generalises in a straightforward 
fashion to the case of multivariate joint distributions of metrically scaled m-D (X, Y, . . . , Z), 
where S G ]R mxm is given by 



/ s 



S :-- 



SXY 
S X Y S Y 



SZX \ 

SYZ 



\ s Z x s Y z 



(4.18) 



4.2.2 Bravais and Pearson's sample correlation coefficient 

The sample covariance sxy constitutes the basis for the second standard measure 
characterising bivariate joint distributions of metrically scaled 2-D (X, Y) descriptively, 
the normalised dimensionless sample correlation coefficient r (metr) devised by the 
French physicist Auguste Bravais (1811-1863) and the English mathematician and statisti- 
cian Karl Pearson FRS (1857-1936) for the purpose of analysing corresponding raw data 
{(x{, yi)}i=i t ,„,n for the existence of linear (!!!) statistical associations. It is defined in terms 
of the bivariate sample covariance sxy and the univariate sample standard deviations sx and sy 
by (cf. Bravais (1846) and Pearson (1901) IffTTl ) 



sxy 
s x s y 



(4.19) 



Due to normalisation, the range of the sample correlation coefficient is— 1 < r < +1. The sign 
of r encodes the direction of a correlation. As to interpreting the strength of a correlation via the 
magnitude \r\, in practice one typically employs the following qualitative 

Rule of thumb: 



0.0 = 
0.0 < 
0.2 < 
0.4 < 
0.6 < 



: no correlation 

< 0.2: very weak correlation 

< 0.4: weak correlation 

< 0.6: moderately strong correlation 

< 0.8: strong correlation 
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0.8 < |r| < 1.0: very strong correlation 
1.0 = |r|: perfect correlation. 

EXCEL: CORREL (dt.: KORREL) 

SPSS: Analyze — > Correlate — > Bivariate . . . : Pearson 

R: cor (variable 1 , variable!, use=" complete . obs " ) 

In addition to Eq. (14.171) . it is convenient to define a dimensionless symmetric (2 x 2) correlation 
matrix R by 



R :-- 



1 r 
r 1 



(4.20) 



which is regular and positive definite as long as 1 — r 2 > 0. Then its inverse R 1 is given by 

1 



R 



1 _ r 2 



1 —r 
-r 1 



(4.21) 



Note that for non-correlating metrically scaled variables X and Y, i.e., when r = 0, the correlation 
matrix degenerates to become a unit matrix, R = 1. 

Again, the concept of a regular and positive definite correlation matrix R, with inverse R , gener- 
alises to multivariate joint distributions of metrically scaled m-D (X, Y, . . . , Z), where R G ]R mxm 
is given by3 

/ 1 r X y ■ ■ ■ r zx \ 
r X y 1 ■■■ r YZ 



R: 



\ r Z x r Y z 



(4.22) 



Note that R is a dimensionless quantity which, hence, is scale-invariant; cf. Sec. 18.91 



4.3 Measures of association for the ordinal scale level 

At the ordinal scale level, raw data {(xj, yi)}j=i,..., n for a 2-D variable (X, Y) is not necessarily 
quantitative in nature. Therefore, in order to be in a position to define a sensible quantitative 
bivariate measure of statistical association for ordinal variables, one needs to introduce meaningful 
surrogate data which is numerical. This task is realised by means of defining so-called ranks, 
which are assigned to the original ordinal data according the procedure described in the following. 

Begin by establishing amongst the observed values {xj}j=i n resp. {yj}i=i,... in their natural hier- 
archical order, i.e., 

< aj(2) < < X{n) and y {1) < y (2 ) < ■ ■ ■< V{ n ) ■ (4.23) 

'Given a data matrix X £ R nxm for a metrically scaled m-D (X, Y, . . . , Z), one can show that upon standardis- 
ation of the data according to Eq. (13.18b . which amounts to a transformation Ih>Z£ M. nxm , the correlation matrix 

1 T 

can be represented by R = Z Z. 

n — 1 
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Then, every individual Xj resp. yi is assigned a numerical rank which corresponds to its position 
in the ordered sequences (14.231 ): 



x,i H> R(xi) , yi H> 



for all 



z = 1, . . . , n . 



(4.24) 



Should there be any "tied ranks" due to equality of some Xi or y^ one assigns the arithmetical 
mean of these ranks to all x i: resp. y^ involved in the "tie". Ultimately, by this procedure the entire 
bivariate raw data undergoes a transformation 



{(xi,yi)}i=i,...,n ^ {[R{%i), R{yi)]}i=i,...,n , 

yielding n pairs of ranks to numerically represent the original ordinal data. 
Given surrogate rank data, the means of ranks always amount to 

n + 1 



(4.25) 



1 

R(x) : = -J>(*0 

i=l 

R(y) := l£>( w ) = n -±± 



(4.26) 
(4.27) 



i=l 



The variances of ranks are defined in accordance with Eqs. (13.131 ) and (13 .15b . i.e. 

n ~| f k 

J2 R2 ( x i) -nR 2 {x) 

. i=l 

n 

Y,R 2 {Vi)-nR 2 {y) 



„2 

S R(x) 



s R(y) 



1 

71—1 
1 



n — 1 



71—1 



5^i* 2 (a i )/i i+ -i2 2 



72 — 1 



i=l 



5=1 



(4.28) 
. (4.29) 



In addition, to characterise the joint distribution of ranks, a covariance of ranks is defined in line 
with Eqs. (|4.14l) and (14.161) by 



SR(x)R(y) 



n — 1 



n 



72 — 1 



Y,R{xi)R{yi)-nR{x)R{y) 

8=1 

^ ^ Ria^R^hij - R{x)R{y) 
i=i j=i 



(4.30) 



On this fairly elaborate technical backdrop, the English psychologist and statistician 
Charles Edward Spearman FRS (1863-1945) defined a dimensionless rank correlation coeffi- 
cient r s (ord), in analogy to Eq. (14. 19b . by (cf. Spearman (1904) [|5D ) 



rs 



s R(x)R(y) 

SR(x)SR(y) 



(4.31) 



The range of this rank correlation coefficient is —1 < r s < +1. Again, while the sign of r s 
encodes the direction of a rank correlation, in interpreting the strength of a rank correlation via 
the magnitude \rs\ one usually employs the qualitative 
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Rule of thumb: 

0.0 = \r s \: no rank correlation 

0.0 < \r s \ < 0.2: very weak rank correlation 

0-2 < \rs\ < 0.4: weak rank correlation 

0-4 < \r s \ < 0.6: moderately strong rank correlation 

0-6 < \r s \ < 0.8: strong rank correlation 

0-8 < \r s \ < 1.0: very strong rank correlation 

1.0 = \rs\- perfect rank correlation. 

SPSS: Analyze — > Correlate — > Bivariate . . . : Spearman 

When no tied ranks occur, Eq. (14.311) simplifies to 



x _ 6ELi[^)-^)] 2 , (4 32) 



r s ■ , ., 

n(n z 



4.4 Measures of association for the nominal scale level 

Lastly, let us turn to consider the case of quantifying descriptively the degree of statistical asso- 
ciation in raw data {(xi, yi)}i=i,..., n for a nominally scaled 2-D variable (X, Y) with categories 
{(aj, &j)}i=i,...,fc;j=i,...,z. The starting point are the observed absolute resp. relative (cell) frequen- 
cies and hij of the bivariate joint distribution of (X, Y), with marginal frequencies o i+ resp. h i+ 
for X and o + j resp. h + j for Y. The % 2 -statistic devised by the English mathematical statistician 
Karl Pearson FRS (1857-1936) rests on the notion of statistical independence of two variables X 



and Y in that it takes the corresponding formal condition provided by Eq. (14.121) as a reference. A 
simple algebraic manipulation of this condition obtains 

multiplication by n 

K = h t+ h +J =► °Ji= J±°-±l ^ 0lJ = °^±L . ( 4.33) 

n n n n 

Pearson's descriptive % 2 -statistic (cf. Pearson (1900) ||4"0~1 ) is then defined by 



(4.34) 



whose range of values amounts to < x 2 < max(% 2 ), with max(x 2 ) := n [min(fc, I) — 1). 

Remark: Provided ° t+0+J > 5 for alH = 1, . . . , k and j = 1, . . . , I, Pearson's x 2 -statistic can 

n 

be employed for the analysis of statistical associations for 2-D variables (X, Y) of almost all 
combinations of scale levels. 

The problem with Pearson's % 2 -statistic is that, due to its variable spectrum of values, it is not 
clear how to interpret the strength of statistical associations. This shortcoming can, however, be 
overcome by resorting to the measure of association proposed by the Swedish mathematician, 
actuary, and statistician Harald Cramer (1893-1985). which basically is the result of a special kind 
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of normalisation of Pearson's measure. Thus, Cramer's V, as it has come to be known, is defined 
by (cf. Cramer (1946) il) 



with range < V < 1. For the interpretation of results, one may now employ the qualitative 
Rule of thumb: 

0.0 < V < 0.2: weak association 

0.2 < V < 0.6: moderately strong association 

0.6 < V < 1.0: strong association. 

SPSS: Analyze — > Descriptive Statistics — > Crosstabs . . . — > Statistics . . . : Chi-square, Phi and 
Cramer's V 




(4.35) 



Chapter 5 

Descriptive linear regression analysis 



For strongly correlating sample data {(xi, yi)}i=i,...,n for some metrically scaled 2-D variable 
(X, Y), i.e., when 0.6 < |r| < 1.0, it is meaningful to construct a mathematical model of the linear 
quantitative statistical association so diagnosed. The standard method to realise such a model is 
due to the German mathematician and astronomer Carl Friedrich Gau fi (1777-18 55) and is known 
by the name of descriptive linear regression analysis; cf. GauB (1809) [17]. We here restrict our 
attention to the case of simple linear regression which involves data for two variables only. 

To be determined is a best-fit linear model to given bivariate metrical data {(a^, 2/i)}i=i,... |n - The 
linear model in question can be expressed in mathematical terms by 



y = a + bx , (5.1) 



with unknown regression coefficients y-intercept a and slope b. GauB' method works as follows. 



5.1 Method of least squares 

At first, one has to make a choice: assign X the status of an independent variable, and Y the 
status of a dependent variable (or vice versa; usually this freedom of choice does exist, unless 
one is testing a specific functional relationship y = f(x)). Then, considering the measured values 
Xi for X as fixed, to be minimised for the F-data is the sum of the squared vertical deviations 
of the measured values y, from the model values = a + bx,i associated with an arbitrary straight 
line through the cloud of data points {(xj, y»)}i=i,... )n in a scatter plot, i.e., the sum 

n n 

S(a, b) := ^(yi - mf = ^2(yi -a- bx,) 2 . (5.2) 
i=i i=i 

S(a, b) constitutes a (non-negative) real-valued function of two variables a and b. Hence, deter- 
mining its (local) minimum values entails satisfying (i) the necessary condition of simultaneously 
vanishing first partial derivatives 

= «, i«, (5.3) 
da ob 
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— this yields a well-determined (2 x 2) system of linear equations for the unknowns a and b, cf. 
Ref. [12, Sec. 3.1] — , and (ii) the sufficient condition of apositive definite Hessian matrix H(a, b) 
of second partial derivatives 



H(a,b) :-- 



( d 2 S(a,b) d 2 S(a,b) \ 
da 2 dadb 

d 2 S{a,b) d 2 S{a,b) 



V dbda 



(5.4) 



db 2 



J 



H(a,b) is referred to as positive definite when all of its eigenvalues are positive; cf. Ref. [fT2l 
Sec. 3.6]. 



5.2 Empirical regression line 



It is a fairly straightforward algebraic exercise (see, e.g., Toutenburg (2004) 11561 p 141ff]) to show 
that the values of the unknowns a and b which determine a unique global minimum of S(a, b) 
amount to 



b = — r 

sx 



a = y — bx 



(5.5) 



These values are referred to as the least square estimators for a and b. Note that they are ex- 
clusively expressible in terms of familiar univariate and bivariate measures characterising the joint 
distribution of X and Y. 

With the solutions a and b of Eq. (15.51) . the resultant best-fit linear model is thus 



y = y H r (x — x) 

sx 



(5.6) 



It may be employed for the purpose of generating intrapolating predictions of the kind x (->• y for 
x- values confined to the interval [im^iw], 

EXCEL: SLOPE, INTERCEPT (dt.: STEIGUNG, ACHSENABSCHNITT) 
SPSS: Analyze — > Regression — >■ Linear . . . 



5.3 Coefficient of determination 

The quality of any particular simple linear regression model, its goodness-of-the-fit, can be quan- 
tified by means of the coefficient of determination B (metr). This measure is derived starting 
from the algebraic identity 



n n n 

i=l i=l i=l 



(5.7) 
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which, upon conveniently re-arranging, leads to defining a quantity 



B : 



Er=i(y» - v) 2 - XXifa - in) 2 = 52Li(yi~y) 2 
££=1(3/1 - y) 2 Et=i(3/i - y) 2 



2 ' 



(5.8) 



with range < B < 1. For a perfect fit 5 = 1, while for no fit B = 0. The coefficient of 
determination provides a descriptive measure for the proportion of variability of Y in a data set 
{(xi, yi)}i=i t ... t n that can be accounted for as due to the association with X via the simple linear 
regression model. Note that in simple linear regression it holds that 



see, e.g., Toutenburg (2004) [23 p 150fJ). 

EXCEL: RSQ (dt: BESTIMMTHEITSMASS) 
SPSS: Analyze — > Regression — > Linear . . . 

This concludes Part I of these lecture notes: the discussion on descriptive statistical methods of 
data analysis. To set the stage for the application of inferential statistical methods in Part III, we 
now turn to review the elementary concepts underlying probability theory. 



B = r 2 ; 



(5.9) 
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Chapter 6 

Elements of probability theory 



All examples of inferential statistical methods of data analysis to be presented in Chs. [TTIand 
[121 have been developed in the context of the so-called frequentist approach to probability the- 
ory. The frequentist approach was pioneered by the French lawyer and amateur mathematician 
Pierre de Fermat (1601-1665), the French mathematician, physicist, inventor, writer and Catholic 
philosopher Blaise Pascal (1623-1662), the Swiss mathematician Jacob Bernoulli (1654-1705), 
and the French mathematician and astronomer Marquis Pierre Simon de Laplace (1749-1827). It 
is deeply rooted in the assumption that any particular random experiment can be repeated arbitrar- 
ily often under the "same conditions" and completely "independently of one another", so that a 
theoretical basis for defining "objective probabilities" of random events is given via the relative 
frequencies of very long sequences of repetition of the same random experiment^] This is a highly 
idealised viewpoint, however, which shares only a limited amount of similarity with the actual 
conditions pertaining to an experimentor's reality. 

Not everyone in Statistics is entirely happy, though, with the philosophy underlying the frequentist 
approach to introducing the concept of probability. A complementary viewpoint is taken by the 
framework which originated from the work of the English mathematician and Presbyterian minister 
Thomas Bayes (1702-1761), and later of Laplace, and so is commonly referred to as the Bayes- 
Laplace approach; cf. Bayes (1763) UJ and Laplace (1812) GTII . A striking qualitative difference 
to the frequentist approach consists in the use of prior "subjective probabilities" for random events 
representing a persons's individual degree- of-belief in their likelihood, which are subsequently 
updated by analysing relevant empirical sample data. A discussion of the pros and cons of both 
approaches to probability theory can be found in, e.g., Sivia and Skilling (2006) (471 P 8ff] or 
Gilboa (2009) US Sec. 5.3]. 

In the following we turn to discuss the general principles on which probability theory is built. 

6.1 Random events 

We begin by introducing some basic formal constructions: 

1 A special role in the context of the frequentist approach to probability theory is assumed by Jacob Bernoulli's law 
of large numbers, as well as the concept of independently and identically distributed random variables; we will discuss 
these issues in Sec. l8.13l below. 
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• Random experiments: Random experiments are experiments which can be repeated arbi- 
trarily often under identical conditions, with events (or outcomes) that cannot be predicted 
with certainty. Well-known simple examples are found amongst games of chance such as 
rolling dice or playing roulette. 

• Sample space Q = {oji, u 2 , ■■■}'■ The sample space associated with a random experiment 
is constituted by the set of all possible elementary events (or elementary outcomes) 

(i = 1,2,...), signified by their property of mutual exclusivity. The sample space Q of a 
random experiment may contain either 

(i) a finite number n of elementary events; then |f2| = n, or 

(ii) countably many elementary events in the sense of a one-to-one correspondence with 
the set of natural numbers N, or 

(iii) uncountably may elements in the sense of a one-to-one correspondence with the set of 
real numbers R, or an open or closed subset thereof. 

• Random events A, B, ... C. ft: Random events are formally defined as any kind of subsets 
of Q that can be formed from the elementary events oj{ E f2. 

• Certain event f2: The certain event is synonymous with the sample space. When a random 
experiment is conducted "something will happen for sure." 

• Impossible event = {} = Cl: The impossible event is the natural complement to the 
certain event. When a random experiment is conducted "it is not possible that nothing will 
happen at all." 

• Event space Z(fl) := {A\A C f2}: The event space, also referred to as the power set of 
fl, is the set of all possible subsets (random events!) that can be formed from elementary 
events Ui E f2. Its size (or cardinality) is given by |Z(f2)| = 2'"'. The event space Z(Q) 
constitutes a so-called cr-algebra associated with the sample space fl; cf. Rinne (2008) H51 
p 177]. When \ fl\ = n, i.e., when ft is finite, then \Z(Q)\ = 2 n . 

In the formulation of probability theoretical laws and computational rules, the following set oper- 
ations and identities prove useful. 

Set operations 

1 . A = ft\A — complementation of set (or event) A ("not A") 

2. A\B = A n B — formation of the difference of sets (or events) A and B ("A, but not 5") 

3. A U B — formation of the union of sets (or events) A and B ("A or B") 

4. An B — formation of the intersection of sets (or events) A and B ("A and B") 
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Computational rules and identities 

1. AU B = B U A and An B = B n A (commutativity) 

2. (AU B)UC = AU(BUC) and 

(A n B) n C = A n (B n C) (associativity) 

3. (A U 5) n C = (A n C) U (5 n C) and 

{A n B) U C = (A U C) n (B U C) (distributivity) 

4. AU5 = A n B and AnB = A U B (de Morgan's laws) 

Before addressing the central issue of probability theory, we first provide the following important 

Def.: Suppose given that the sample space fl of some random experiment is compact. Then one 
understands by a finite complete partition of fl a set of n E N random events {Ai, . . . , A n } such 
that 

(i) Ai fl Aj = for % 7^ j, i.e., they are pairwise disjoint, and 

n 

(ii) Ai = Q, i.e., their union is identical to the sample space. 



6.2 Kolmogorov's axioms of probability theory 

It took a fairly long time until, by 1933, a unanimously accepted basis of probability theory 
was established. In part the delay was due to problems with providing a unique definition of 
probability and how it could be measured and interpreted in practice. The situation was resolved 
only when the Russian mathematician Andrey Nikolaevich Kolmogorov (1903-1987) proposed to 
discard the intention of providing a unique definition of probability altogether, but restrict the 
issue instead to merely prescribing in an axiomatic fashion a minimum set of essential properties 
any probability measure needs to have in order to be coherent. We now recapitulate the axioms 
that Kolmogorov put forward; cf. Kolmogoroff (1933) ll24l . 

For given random experiment, let fl be its sample space and Z(ft) the associated event space. 
Then a mapping 

P:Z(Cl)^R> (6.1) 
defines a probability measure with the following properties: 

1. for all random events A e Z(ft), (non-negativity) 

P(A) > , (6.2) 



2. for the certain event fl G Z(tt), (normalisability) 

P(Sl) = 1 , (6.3) 
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3. for all pairwise disjoint random events A 1 ,A 2 ,... G Z(ft), i.e., A { D A, = for all z ^ j, 

(cr-additivity) 

too \ oo 
|J Ai = P(Ax U A 2 U . . .) = P(Ai) + P(A 2 ) + . . . = p ( A i) ■ ( 6 - 4 ) 
i=l / i=l 

The first two axioms imply the property 

< P(A) < 1 , for all A G Z(Q) . (6.5) 

A less strict version of the third axiom is given by requiring only finite additivity of a probability 
measure. This means it shall possess the property 

P{A X U A 2 ) = P(Ax) + P{A 2 ) , for any two A u A 2 eZ{Q) with A 1 n A 2 = . (6.6) 

The following consequences for random events A, B, Ai, A 2 , . . . E Z(Q) can be derived from 
Kolmogorov's three axioms of probability theory; cf., e.g., Toutenburg (2005) B571 p 19ff]: 

Consequences 

1. P(A) = l-P(A) 

2. p(0) = p(n) = o 

3. P(Ai U A 2 ) = P{Ai) + P(A 2 ) - P(Ai n A 2 ), occasionally referred to as convexity of a 
probability measure; cf. Gilboa (2009) f[T8l p 1601. 

4. If A C P, then P(A) < P(B). 

n 

5. P(B) = P(P fi Ai), provided the nGN random events A4 constitute a finite complete 
partition of the sample space fl. 

6. P(A\B) = P(A) - P(A n P). 



6.3 Laplacian random experiments 

Games of chance with a finite number n of possible mutually exclusive elementary outcomes, 
such as flipping a single coin once, rolling a single dye once, or selecting a single play- 
ing card from a deck of 32, belong to the simplest kinds of random experiments. In this 
context, there exists a frequentists' notion of a unique "objective probability" associated with 
any single possible random event (outcome) that may occur. Such probabilities can be com- 
puted according to a straightforward procedure due to the French mathematician and astronomer 
Marquis Pier re Simon de Laplace (1749—1827)) The procedure rests on the assumption that the 
device generating the random events is a "fair" one. 
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Consider a random experiment, the n elementary events = 1, . . . , n) of which are "equally 
likely," meaning they are assigned equal probability: 

P(uji) = -L = i , for all u>i e ft (i = 1, . . . , n) . (6.7) 

Random experiments of this nature are referred to as Laplacian random experiments. 

Def.: For a Laplacian random experiment, the probability of an arbitrary random event A e Z(f2) 
can be computed according to the rule 



P(A) := - 


A 


Number of cases favourable to A 


ft 


Number of all possible cases 



(6.8) 



The probability measure P here is called a Laplacian probability measure. 

The systematic counting of the possible outcomes of random experiments in general is the central 
theme of combinatorics. We now briefly address its main considerations. 

6.4 Combinatorics 

At the heart of combinatorial considerations is the well4inown urn model. This supposes given 
an urn containing iVeN balls that are either 

(a) all different and thus can be uniquely distinguished from one another, or 

(b) there are s e N (s < N) subsets of indistinguishable like balls, of sizes n\,...,n s resp., 
such that ni + . . . + n s = N. 



6.4.1 Permutations 

Permutations relate to the number of distinguishable possibilities of arranging N balls in an or- 
dered sequences. Altogether, for cases (a) resp. (b) one finds that there are a total number of 



(a) all balls different 


(b) s subsets of like balls 




N\ 




rii\n 2 \ • • • n s \ 



different possibilities. The factorial of a natural number N e N is defined by 

iV! := N x (N - 1) x (N - 2) x • • • x 3 x 2 x 1 . 



(6.9) 
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6.4.2 Combinations and variations 

Combinations and variations ask for the total number of distinguishable possibilities of selecting 
from a collection of N balls a sample of size n < N, while differentiating between cases when 

(a) the order in which balls were selected is either neglected or instead accounted for, and 

(b) a ball that was selected once either cannot be selected again or indeed can be selected again 
as often as possible. 

These considerations result in the following cases of 





no repetition 


with repetition 


combinations (order neglected) 




c: 


) 


r+r) 


variations (order accounted for) 


( 


v n ) 







different possibilities. Remember, herein, that the binomial coefficient for natural numbers n, N e 
N, n < N is defined by 

N \ N\ 

(6.10) 



and satisfies the identity 



n 

N 
n 



n\(N-n)\ ' 



N 
N-n 



(6.11) 



To conclude this chapter, we turn to discuss the important concept of conditional probabilities of 

random events. 



6.5 Conditional probabilities 

Consider some random experiment with sample space f2, event space Z(Q), and a well-defined, 
unique probability measure P. 

Def.: For random events A,B e Z{Q), with P{B) > 0, 



P(A\B) :- 



P(Ar\B) 

P{B) 



(6.12) 



defines the conditional probability of A to happen, given that B happened before. Analogously, 
one defines a conditional probability P{B\A) with the roles of random events A and B switched, 
provided P(A) > 0. 
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Def.: Random events A, B e Z(fl) are called mutually stochastically independent, if, simulta- 
neously, the conditions 



P{A\B) = P{A), P{B\A) = P{B) 



Eg I67T2I 



P(Af]B) = P{A)P{B) 



(6.13) 



are satisfied, i.e., when for both random events A and B the a posteriori probabilities P(A\B) 
and P(B\A) coincide with the respective a priori probabilities P(A) and P{B). 

For applications, the following two prominent laws of probability theory prove essential. 

6.5.1 Law of total probability 

By the law of total probability it holds for any random event B 6 Z(fl) that 



(6.14) 



provided the random events A\, ... , A m E Z(fl) constitute a finite complete partition of Q into 
m e N pairwise disjoint events. 




6.5.2 Bayes' theorem 

This important result is due to the English mathematician and Presbyterian minister 
IThomas Bayes (1702-1761)1 cf. Bayes (1763) Q. It states that for any random event B e Z(Q) 



P{A\B) 



(6.15) 



provided the random events Ax, ... , A m E Z(fl) constitute a finite complete partition of Q into 

m 

men pairwise disjoint events, and P{B) = P(B\A i )P(A i ) > 0. 

i=l 

The different terms in Eq. (16.151 ) have been given special names: 

• P(Ai): prior probability of A it 

• P(B\Ai): likelihood of B, given Ai, and 

• P(Ai\B): posterior probability of Ai, given B. 

On the backdrop of some random event B, Bayes' theorem thus foremost relates the posterior 
probability P(Ai\B) of a particular random event Ai to its prior probability P(Ai). This result 
forms the basis of the frequently encountered empirical practice of updating one's prior "subjective 
probability" of a specific random event by means of adequate experimental or observational data 
(and corresponding theoretical considerations); see, e.g., Sivia and Skilling (2006) [1471 p 5ff]. In 
particular, Bayesian statistics forms a cornerstone in the mathematical modelling of economic 
agents' choice behaviour under conditions of uncertainty; cf. the brief review by Svetlova and van 
Elst (2012) Il54l . and references therein. 
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Chapter 7 

Discrete and continuous random variables 



Applications of inferential statistical methods (to be discussed in Chs. \TT\ and [12] below) rest 
fundamentally on the concept of a probability-dependent quantity arising in the context of random 
experiments which is referred to as a random variable. The present chapter aims to provide a 
basic introduction to the general properties and characteristic features of these quantities. We begin 
by stating their definition. 

Def.: A real- valued random variable is defined by a one-to-one mapping 



of the sample space f2 of some random experiment into a subset W of the real numbers R. 

Depending on the nature of the spectrum of values of X, we will distinguish in the following 
between random variables of the discrete and of the continuous kind. 

7.1 Discrete random variables 

Discrete random variables are signified by the existence of a finite or countably infinite 
Spectrum of values: 



All values Xi (i — 1, . . . , n) in this spectrum, referred to as possible realisations of X, are assigned 
individual probabilities Pi by a 

Probability function: 



X : n -> W C R 



(7.1) 



X x E {xi, . . . , x n } C R , with n E N . 



(7.2) 



P(X = x^ =pi for i = l,..., 



n , 



(7.3) 



with properties 



(i) < pi < 1, and 



(non-negativity) 



n 




(normalisability) 



i=l 
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Specific distributional features of a discrete random variable X deriving from P(X = x^) are 
encoded in the associated theoretical 



Cumulative distribution function (cdf ): 




1 . 



The cdf exhibits the asymptotic behaviour 

lim F x (x) = , 

x— >— oo 

Information on the central tendency and the variability of a discrete X resides in its 
Expectation value and variance: 



lim Fx (x) 

x^+oo 



(7.4) 



(7.5) 



E(X) := J2 X * P ( X = X *) 

i=l 
n 

Var(X) := ^ ( Xi - E(X)) 2 P(X = x t ) . 



(7.6) 
(7.7) 



By the so-called shift theorem it holds that the variance may alternatively be obtained from the 
computationally more efficient formula 



Var(X) = E [(X - E(X)) 2 ] = E(X 2 ) - [E(X)f 



(7.8) 



Specific values of E(X) and Var(X) will be denoted throughout by the Greek letters /j, and a 2 , 
respectively. 

The evaluation of event probabilities for a discrete random variable X follows from the 
Computational rules: 

(7.9) 
(7.10) 
(7.11) 
(7.12) 
(7.13) 
(7.14) 

c) (7.15) 
(7.16) 

where c and d denote arbitrary lower and upper cut-off values imposed on the spectrum of X. 
In applications it is frequently of interest to know the values of a discrete cdf 's 
a-quantiles: 

These are realisations x a of X specifically determined by the condition that X take values x < x a 
at least with probability a (for < a < 1), i.e., 



P(X < d) 


= Fx{d) 






P(X < d) 


= F x {d)- 


- P(X = d) 




P(X > c) 


= I -F x 


; C ) + p(x = C ) 




P(X > c) 


= I -F x 


[c) 




P(c<X < d) 


= F x {d)- 


-F x (c) + P(X = 


c) 


P(c<X < d) 


= F x {d)- 


-F x (c) 




P(c<X < d) 


= F x (d)- 


~F x (c)-P(X = 


d) + P(X 


P(c<X < d) 


= F x {d)- 


-F x (c)-P(X = 


d), 



F x {x a ) = P(X <x a )>a 



and 



F x (x) = P(X < x) < a for x < x a . (7.17) 



7.2. CONTINUOUS RANDOM VARIABLES 

7.2 Continuous random variables 
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Continuous random variables possess an uncountably infinite 
Spectrum of values: 

I4XGDCR. (7.18) 

It is therefore no longer meaningful to assign probabilities to individual realisations x of X, but 
only to infinitesimally small intervals dx G D by means of a 

Probability density function (pdf ): 



f x (x)=pdf(x). (7.19) 



Hence, approximately, P(X G dx) /x(^)dx, for some £ G dx. The pdf of an arbitrary 
continuous X has the defining properties: 

(i) /x(^) > for all x G D, (non-negativity) 

/+oo 
/x(^) dx = 1, and (normalisability) 
■oo 

(iii) f x {x) = F x (x). (linktocdf) 
The evaluation of event probabilities for a continuous X rests on the associated theoretical 

Cumulative distribution function (cdf ): 



F x (x) = cdf (x) := P(X < x) = [ f x (t) 



dt 



(7.20) 



These are to be obtained according to the 
Computational rules: 

P(X = d) = (7.21) 

P(X < d) = F x (d) (7.22) 

P(X>c) = 1-F x (c) (7.23) 

P(c<X<d) = F x (d)-F x (c), (7.24) 

where c and d denote arbitrary lower and upper cut-off values imposed on the spectrum of X. Note 
that, again, the cdf exhibits the asymptotic properties 

lim F x (x) = 0, lim F x (x) = 1 . (7.25) 

x— oo £— S-+00 

The central tendency and the variabilty of a continuous X are quantified by its 
Expectation value and variance: 



E(X) := / xf x (x)dx (7.26) 

J — oo 
r+oo 

Var(X) := / (x - E(X)) 2 f x (x) dx . {121) 
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Again, by the shift theorem the variance may alternatively be obtained from the computationally 
more efficient formula Var(X) = E [(X - E(X)) 2 ] = E(X 2 ) - [E(X)f. Specific values of E(X) 
and Var(X) will be denoted throughout by [i and a 2 , respectively. 

The construction of interval estimates for unknown distribution parameters of given populations, 
and the statistical testing of hypotheses (to be discussed later in Chs. QT|and[T2]), require explicit 
knowledge of the a-quantiles associated with the cdf s of particular continuous random variables 
X. In the present case, these are defined as follows. 

a-quantiles: 

X take values x < x a with probability a (for < a < 1), i.e., 



P{X <x a )=F x {x 



a 



Fx (x) is strictly monotonously increasing 



F~\a) 



(7.28) 



Hence, a-quantiles of the distributions of continuous X are determined by the inverse cdf F x l . 
For given a, the spectrum of X is thus naturally partitioned into domains x < x a and x > x a . 

7.3 Lorenz curve for continuous random variables 

For the distribution of a continuous random variable X, the associated Lorenz curve expressing 
qualitatively the degree of concentration involved is defined by 



(7.29) 



with x a denoting a particular a-quantile of the distribution in question. 

7.4 Linear transformations of random variables 

Linear transformations of real- valued random variables X are determined by the two-parameter 
relation 

' " (7.30) 




Y = a + bX with a,beR,b^0 



where Y denotes the resultant new random variable. Transformations of random variables of this 
kind have the following effects on the computation of expectation values and variances. 



7.4.1 Effect on expectation values 

1. E(a) = a 

2. E(6X) = 6E(X) 

3. E(Y) = E(a + bX) = E(a) + E(6X) = a + 6E(X). 



7.5. SUMS OF RANDOM VARIABLES AND REPRODUCTIVITY 
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7.4.2 Effect on variances 

1. Var(a) = 

2. Var(bX) = b 2 Var(X) 

3. Var(F) = Var(a + bX) = Var(a) + Vai(bX) = 6 2 Var(X). 



7.4.3 Standardisation 



Standardisation of an arbitrary random variable X, with ^Xai(X) > 0, implies the determi- 
nation of a special linear transformation X i— > Z according to Eq. (|7.30l) such that the expecta- 
tion value and variance of X are re-scaled to their simplest values possible, i.e., E(Z) = and 
Var(Z) = 1. Hence, the two (in part non-linear) conditions 

= E(Z) = a + bE(X) and 1 = Var(Z) = 6 2 Var(X) , 
for unknowns a and b, need to be satisfied simultaneously. These are solved by, respectively, 

E(X) . . 1 



A/VarpO 



and b 



and so 



X ->■ Z 



X - E(X) 
VVarCX) ' 



l4z 



X — [l 



a 



C 



(7.31) 



(7.32) 



irrespective of whether the random variable X is of the discrete kind (cf. Sec. 17.11 ) or of the 
continuous kind (cf. Sec. 17.21) . It is essential for applications to realise that under the process 
of standardisation the values of event probabilities for a random variable X remain invariant 
(unchanged), i.e., 



1 v/Var(X) " o 



P(Z < z) . 



(7.33) 



7.5 Sums of random variables and reproductivity 



Def.: For a set of n additive random variables Xi, . 
Y n and a mean random variable X n according to 



, X n , one defines a total sum random variable 



and 



v ■ V 

■**-n • 1 n ■ 



i=l 



1 

n 



By linearity of the expectation value operation^ it then holds that 



1 



E(Y n ) = E ( X * )= Yl E ^ and E ( X «) = i E ( y «) ' 
,i=i / i=i 



(7.34) 



(7.35) 



[ That is: E(Xi + X 2 ) = E(Xi) + E(X 2 ) 
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If, in addition, the X 1; . . . , X n are mutually stochastically independent according to Eq. (16.131) . it 
follows from Subsec. I7.4.2l that the variances of Y n and X n are given by 

Var(F n ) = Var (^Z X ^j = Yl Var ( x <) and Var(X n ) = ^ Var(F n ) . (7.36) 

Def.: Reproductivity of a probability distribution law (cdf ) F(x) is given when the sum Y n of n 
independent and identically distributed (in short: "i.i.d.") additive random variables X 1; . . . , X n , 
which each individually satisfy distribution laws FxX x ) — F( x )> inherits this same distribution 
law F(x) from its underlying n random variables. Examples of reproductive distribution laws, to 
be discussed in the following Ch.[8l are the binomial, the GauBian normal, and the ^-distributions. 



Chapter 8 



Standard distributions of discrete and 
continuous random variables 



In this chapter, we review (i) the probability distribution laws which one typically encounters as 
theoretical distributions in the context of the statistical testing of hypotheses (cf. Chs. [TTIand 
[121) . but we also include (ii) cases of well-established pedagogical merit, and (iii) a few examples 
of rather specialised distribution laws, which, nevertheless, prove to be of interest in the description 
and modelling of various theoretical market situations in Economics. We split our considerations 
into two main parts according to whether a random variable X underlying a particular distribution 
law varies discretely or continuously. For each of the cases selected, we list the spectrum of values 
of X, its probability function (for discrete X) or probability density function (pdf ) (for con- 
tinuous X), its cumulative distribution function (cdf ), its expectation value and its variance, 
and, in some continuous cases, also its a-quantiles. Additional information, e.g. commands by 
which a specific function may be activated for computational purposes or plotted on a GDC, by 
EXCEL, or by R, is included where available. 

8.1 Discrete uniform distribution 

One of the simplest probability distribution laws for a discrete random variable X is given by the 
one-parameter discrete uniform distribution, 



X ~ L(n) , 



(8.1) 



which is characterised by the number n of different values in X's 
Spectrum of values: 

X x e {xi, . . . , x n } C R , with u6N. 



(8.2) 



Probability function: 



P(X = Xi) = - for i = 1, . . . , 



n ; 



(8.3) 



n 



its graph is shown in Fig. l8.1l below for n — 6. 
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Figure 8.1: Probability function of the discrete uniform distribution according to Eq. (|8.3I ) for the 

case L(6). 



Cumulative distribution function (cdf ): 




(8.4) 



Expectation value and variance: 



n 

E(X) = Vx, X - 
n 

i=i 



Var(X) 



2 2 

\i — a 



(8.5) 
(8.6) 



The discrete uniform distribution is synonymous with a Laplacian probability measure; cf. 
Sec. [631 



8.2. BINOMIAL DISTRIBUTION 

8.2 Binomial distribution 



57 



8.2.1 Bernoulli distribution 



Another simple probability distribution law, for a discrete random variable X with only two possi- 
ble values, and 1, is due to the Swiss mathematician Jacob Bernoulli (1654-1705), The Bernoulli 
distribution, 

X~B(l;p), (8.7) 
depends on a single free parameter, the probability p E [0; 1] for X = 1. 
Spectrum of values: 

X^xE {0, 1} . (8.8) 

Probability function: 



P{X = x) 



1 



x 



p x (l~py- x , with 0<p<l 



(8.9) 



its graph is shown in Fig. !8.2l below for p = -. 

3 

Cumulative distribution function (cdf ): 




(8.10) 



Expectation value and variance: 



E(X) = 0x(l-p) + lxp = p 
Var(X) = 2 x(l-j)) + l 2 xp-}) 2 = 



(8.11) 
(8.12) 



8.2.2 General binomial distribution 

A direct generalisation of the Bernoulli distribution is the case of a discrete random variable 
X which is the sum of n mutually stochastically independent, identically Bernoulli-distributed 
("i.i.d.") random variables Xj ~ B(l; p) (i — 1, . . . , n), i.e., 

n 

X:=Y,X l = X 1 + ... + X n , (8.13) 

i=i 

which yields the reproductive two-parameter binomial distribution 

X~B(n;p), (8.14) 



again with p 6 [0; 1] the probability for a single Xi — 1. 
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o.o 



0.2 



0.4 



0.6 



0.8 



1.0 



Figure 8.2: Probability function of the Bernoulli distribution according to Eq. (I8.9b for the case 



Spectrum of values: 
Probability function^ 



I4i6{0,...,n} , with n E N . 



(8.15) 



P{X = x) 



with < p < 1 



(8.16) 



its graph is shown in Fig. !8.3l below for n — 10 and p — -. 

5 



'in the context of an um model with M black balls and N — M white balls, and the random selection of n 
balls from a total of N, with repetition, this probability function can be derived from Laplace's principle of forming 
the ratio between the "number of favourable cases" and the "number of all possible cases", cf. Eq. ( 16.8b . Thus, 



n 



M X {N -M) n ~ 



P{X = x) = 
accordingly from the definition p := M/N . 



-, where x denotes the number of black balls drawn, and one substitutes 



8.2. BINOMIAL DISTRIBUTION 
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10 



Figure 8.3: Probability function of the binomial distribution according to Eq. (I8.16I) for the case 



Cumulative distribution function (cdf ): 




Expectation value and variance: 



E(X) = = np 

i=i 

n 

Var(X) = — p) = np(l — p) . 



i=i 



(8.17) 



(8.18) 
(8.19) 



The results for E(X) and Var(X) are based on the rules (17.351 ) and (17.361 ). the latter of which 
applies to a set of mutually stochastically independent random variables. 

GDC: binompdf (n, p } x), binomcdf (n,p, x) 

EXCEL: BINOMDIST (dt.: BINOM . VERT, BINOMVERT, BINOM . INV) 
R: dbinom(x, n,p), pbinom(x, n,p), qbinom(x, n,p) 
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8.3 Hypergeometric distribution 

The hypergeometric distribution for a discrete random variable X derives from an urn model 
with M black balls and N — M white balls, and the random selection of n balls from a total of 
N (n < N), without repetition. If X represents the number of black balls amongst the n selected 
balls, it is subject to the three-parameter probability distribution 



In particular, this model forms the mathematical basis of the internationally popular National Lot- 
teries "6 out of 49", in which case there are M = 6 winning numbers amongst the total of iV = 49 
numbers, and X e [0; 6] denotes the number of correct guesses marked on an individual player's 
lottery ticket. 

Spectrum of values: 



X ~ H(n, M, N) . 



(8.20) 



X ^ x e {max(0, n - (N - M)), . . . , min(n, M)} . 



(8.21) 



Probability function: 




(8.22) 



Cumulative distribution function (cdf ): 



F x (x) = P(X < x) 



E 




N — M 
n — k 



(8.23) 



fc=max(0,n-(JV-M)) 




Expectation value and variance: 



E(X) 



M 



(8.24) 




(8.25) 



EXCEL: HYPGEOMDIST (dt.: HYPGEOM . VERT, HYPGEOMVERT) 



8.4 Continuous uniform distribution 



The simplest example of a probability distribution for a continuous random variable X is the 
continuous uniform distribution, 

X ~ Rec(a; b) , (8.26) 



8.4. CONTINUOUS UNIFORM DISTRIBUTION 
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also referred to as the rectangular distribution. Its two free parameters a and b denote the limits 
ofX's 



Spectrum of values: 



Probability density function (pdf )j^ 



(8.27) 





r 1 


for x e [a, b] 

> 


fx(x) = < 


b — a 




lo 


otherwise 



(8.28) 



its graph is shown in Fig. !8.4l below for three different combinations of the parameters a and b. 



-2 



Figure 8.4: pdf of the continuous uniform distribution according to Eq. (18.281) for the cases 
Rec(0; 5), Rec(l; 4) and Rec(2; 3). 



2 It is a nice and instructive little exercise, strongly recommended to the reader, to go through the details of explicitly 
computing from this simple pdf the corresponding cdf and the expectation value and variance of X ~ i?ec(a; b). 
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Cumulative distribution function (cdf ): 








for 


x < a 


F x (x) = P(X < x) = < 


x — a 


for 


x G [a, b] . 


b — a 




1 


for 


x > b 



(8.29) 



Expectation value and variance: 



E(X) 
Var(X) 



a + b 
2 

(6 - a) 2 
12 



(8.30) 



(8.31) 



Note that with Eq. (18.291 ) it is thus a general result that for all continuous uniform distributions 



o) < x < V3( a + 6) + (6 - a) 



2v/3 



v/3 



0.5773 



(8.32) 



i.e., the probability that X falls within one standard deviation ("1-er") of E(X) is l/\/3. a- 
quantiles of continuous uniform distributions are obtained by straightforward inversion, i.e., for 

< a < 1, 



a = F x (x a ) = -y x Q = F^ 1 (a;) = a + a(6 — a) 



(8.33) 



R: dunif (x, a, b), punif (x, a, b), qunif (x, a, b) 
Standardisation according to Eq. (17.321 ) yields 



r 2X - (a + b) 



b — a 



with pdf 



for z G [-V3, v 7 ^ 



/*(*) 



and cdf 




(8.34) 



(8.35) 



/•zO P(Z<:) <(i±^! f 0r x G [— v^? \/3] . 

2v 3 



(8.36) 



1 

v 



for x > 



8.5. GAUSSIAN NORMAL DISTRIBUTION 
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8.5 GauBian normal distribution 

The most prominent continuous probability distribution, which is ubiquitous in Inferential Statis- 
tics (cf. Chs. [iTI and [T2l), is due to |Carl Friedrich Gaufi ( TT77-1 855): the reproductive two- 
parameter normal distribution 

X ~ N(ji- a 2 ) . (8.37) 
The meaning of the parameters /i and a 2 will be explained shortly. 
Spectrum of values: 

I^ieiCR. (8.38) 

Probability density function (pdf ): 



(8.39) 



This normal-pdf defines a reflection- symmetric characteristic bell-shaped curve, the 
analytical properties of which were first discussed by the French mathematician 
Abraham de Moivre (1667-1754) The x-position of this curve's (global) maximum is specified 
by //, while the x-positions of its two points of inflection are determined by \i — a resp. ii + a. 
The effects of different values of the parameters li and a on the bell-shaped curve are illustrated 
in Fig. [83] below. 






Figure 8.5: pdf of the GauBian normal distribution according to Eq. (I8.391 ). Left panel: cases 
iV(-2; 1/4), iV(0; 1/4) andiV(l; 1/4). Right panel: cases iV(0; 1/4), iV(0; 1) andiV(0;4). 



Cumulative distribution function (cdf ): 




(8.40) 
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The normal-cdf cannot be expressed in terms of elementary mathematical functions. 
Expectation value and variance: 

E(X) = /i 
Var(X) = a 2 . 



(8.41) 
(8.42) 



GDC: normalpdf (x, /i, a), normalcdf (— oo, x, /x, a) 
EXCEL: NORMDIST (dt: NORM. VERT, NORMVERT) 
R: dnorm(a;, /i, a), pnorm(x, /i, a) 

Upon standardisation of X according to Eq. (17.321) . a given normal distribution is transformed into 
the unique standard normal distribution, Z ~ iV(0; 1), with 

Probability density function (pdf ): 









\ An 


-i/ 

2 


for zeR; 



its graph is shown in Fig. !8.6l below. 



(8.43) 




Figure 8.6: pdf of the standard normal distribution according to Eq. (18 .4-3t . 
Cumulative distribution function (cdf ): 



${z) := P(Z < z) 



1 

/2tt 



exp 



dt . 



(8.44) 



8.6. x 2 -DISTRIBUTION 
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The resultant random variable Z ~ iV(0; 1) satisfies the 
Computational rules: 



P(Z < b) = $(6) (8.45) 

P(Z >a) = 1 - $(a) (8.46) 

P(a<Z<b) = $(6) - $(a) (8.47) 

= l-$(z) (8.48) 

P(-z < Z < z) = 2$(z) - 1 . (8.49) 



The probability that a (standard) normally distributed random variable takes values inside an in- 
terval of length k times two standard deviations centred on its expectation value is given by the 
important fccr-rule according to which 

Eq. £732} Eq. (&A9\ 

P{\X -y\< ha) ^ P{-k <Z<+k) ^> 2$(fc) - 1 for k > . (8.50) 

a-quantiles associated with Z ~ A^(0; 1) are obtained from the inverse standard normal-cdf 
according to 

a = P{Z < z a ) = $0«) <=> z a = $- 1 (a) for all < a < 1 . (8.51) 

Due to the reflection symmetry of ip(z) with respect to the vertical axis at z = 0, it holds that 

z a = —zi_ a . (8.52) 

For this reason, one typically finds z a listed in textbooks on Statistics only for a G [1/2,1). Alter- 
natively, a particular z a may be obtained from a GDC, EXCEL, or R. The backward transformation 
from a particular z a of the standard normal distribution to the corresponding x a of a given normal 
distribution follows from Eq. (|7.32l ) and amounts to x a = /x + z a a. 

GDC: invNorm(a) 

EXCEL: NORMS INV (dt: NORM . S . INV, NORMINV) 
R: qnorm(a) 

The (standard) normal distribution, as well as the next three examples of continuous probability 
distributions, are commonly referred to as the test distributions, due to the central roles they play 
in the statistical testing of hypotheses (cf. Chs. [TT1and[T2l). 



8.6 x 2 -distribution with n degrees of freedom 

The reproductive one-parameter ^-distribution with n degrees of freedom was devised by the 
English mathematical statistician Karl Pearson FRS (1857-1936); cf. Pearson (1900) [|40l . The 
underlying continuous random variable 



X 



X 2 (n) 



(8.53) 
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is perceived of as the sum of squares of n stochastically independent, identically standard normally 
distributed ("i.i.d.") random variables Z,- L ~ iV(0; 1) (i = 1, . . . , n), i.e., 

n 

X :=J2 Z i = Z i + ■■■ + Z n, with neN. (8.54) 

i=l 

Spectrum of values: 

I^igDC M>o . (8.55) 

The probability density function (pdf ) of a ^-distribution with df = n degrees of freedom 
is a fairly complicated mathematical expression; see Rinne (2008) [|45l p 319] for the explicit 
representation of the x 2 pdf . Plots are shown for four different values of the parameter n in 
Fig. 18.71 The x 2cd f cannot be expressed in terms of elementary mathematical functions. 




10 20 30 40 50 



Figure 8.7: pdf of the ^-distribution for df = n E {3, 5, 10, 30} degrees of freedom. The curves 
with the highest and lowest peaks correspond to the chases x 2 (3) and x 2 (30), respectively. 

Expectation value and variance: 

E(X) = n (8.56) 
Var(X) = 2n. (8.57) 

a-quantiles, Xn-a> °f ^-distributions are generally tabulated in textbooks on Statistics. Alterna- 
tively, they may be obtained from EXCEL or R. 
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Note that for n > 50 a x 2 (^-distribution may be approximated reasonably well by a normal 
distribution N(n, 2n). This is a reflection of the central limit theorem, to be discussed in Sec. l8.13l 
below. 

GDC: x 2 pdf (x, n), x 2 cdf (0, x, n) 

EXCEL: CHIDIST, CHIINV(dt.: CHIQU.VERT, CHI VERT, CHIQU. INV, CHIINV) 
R: dchisq(t, n), pchisq(t, n), qchisq(t, n) 



8.7 t-distribution with n degrees of freedom 

The non-reproductive one-parameter t-distribution with n degrees of freedom was discovered 
by the English statistician William Sealy Gosset (1876-1937)| Somewhat irritating the scientific 
community, he published his findings under the pseudonym of "Student"; cf. Student (1908) ||52|| . 
Consider two stochastically independent random variables, Z ~ N(0; 1) and X ~ X 2 ( n )> satisfy- 
ing the indicated distribution laws. Then the quotient random variable defined by 

(8.58) 




is t-distributed with df = n degrees of freedom. 
Spectrum of values: 

T^teDCI. (8.59) 

The probability density function (pdf ) of a t-distribution, which exhibits a reflection symmetry 
with respect to the vertical axis at t = 0, is a fairly complicated mathematical expression; see Rinne 
(2008) B5l p 326] for the explicit representation of the tpdf . Plots are shown for four different 
values of the parameter n in Fig. 18.81 The tcdf cannot be expressed in terms of elementary 
mathematical functions. 

Expectation value and variance: 

E(X) = (8.60) 

Var(X) = for n > 2 . (8.61) 

n — 2 

a-quantiles, t n . a , of t-distributions, for which, due to the reflection symmetry of the tpdf, the 
identity t n]a = — t n ;i-a holds, are generally tabulated in textbooks on Statistics. Alternatively, 
they may be obtained from some GDCs, EXCEL, or R. 

Note that for n > 50 a t(n)-distribution may be approximated reasonably well by the standard 
normal distribution, N(0; 1). Again, this is a manifestation of the central limit theorem, to be 
discussed in Sec. l8.13l below. 

GDC: tpdf (t, n), tcdf (—10, t, n), invT(a, n) 

EXCEL: TDIST, TINV (dt: T . VERT, TVERT, T . INV, TINV) 

R: dt(i, n), pt(t, n), qt(i, n) 
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Figure 8.8: pdf of the t-distribution for df = n e {1, 2, 5, 50} degrees of freedom. The curves 
with the lowest and highest peaks correspond to the cases £(1) and £(50), respectively. In the latter, 
the £pdf is essentially equivalent to the standard normal pdf. Notice the fatter tails of the £pdf 
for small values of n. 

8.8 F-distribution with m and n<i degrees of freedom 

The reproductive two-parameter .F-distribution with n x and n 2 degrees of freedom was made 
prominent in Statistics by the English statistician, evolutionary biologist, eugenicist and geneticist 
Sir Ronald Aylmer Fisher FRS (1890-1962) , and the US-American mathematician and statistician 
|George W addel Snedecor (1881-1974)[ cf. Fisher (1924) O and Snedecor (1934) (2B. Consider 
two sets of stochastically independent, identically standard normally distributed ("i.i.d.") random 
variables, X i ~ A^(0; 1) (i — 1, . . . , rii), and Yj ~ A^(0; 1) (j — 1, . . . , n 2 ). Define the sums 



each of which satisfies a ^-distribution with n\ resp. n 2 degrees of freedom. Then the quotient 
random variable 




and 




(8.62) 



i=l 



F ni , n2 :=—. F(m,n 2 ) , with n 1 ,n 2 eN 




(8.63) 



is F-distributed with dfi = n\ and df 2 = n 2 degrees of freedom. 
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Spectrum of values: 

F ni , n2 ^ fn u n 2 G D C R> . (8.64) 

The probability density function (pdf ) of an F-distribution is quite a complicated mathematical 
expression; see Rinne (2008) [|45l p 330] for the explicit representation of the Fpdf . Plots are 
shown for three different combinations of the parameters n x and n 2 in Fig. 18.91 The Fcdf cannot 
be expressed in terms of elementary mathematical functions. 





Figure 8.9: pdf of the F-distribution for three combinations of degrees of freedom (df\ = 
ni, df2 = n 2 ). The curves correspond to the cases F(80, 40) (highest peak), F(10, 50), andF(3, 5) 
(lowest peak), respectively. 

Expectation value and variance: 

E(X) = — ^— for n 2 > 2 (8.65) 
n 2 - 2 

ni{n 2 - 2) 2 (n 2 - 4) 

a-quantiles, f ni>nr>a , of F-distributions are tabulated in advanced textbooks on Statistics. Alter- 
natively, they may be obtained from EXCEL or R. 

GDC: Fpdf (x, rii,n 2 ), Fcdf (0, x, n%, n 2 ) 

EXCEL: FDIST, FINV (dt: F . VERT, FVERT, F . INV, FINV) 

R: df(x,ni,n 2 ), pf (x,ni,n 2 ), qf(t,ni,n 2 ) 
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8.9 Pareto distribution 



When studying the distribution of wealth and income of people in many different countries at 
the end of the 19 th Century, the Italian engineer, sociologist, economist, political scientist and 
philosopher Vilfredo Federico Damaso Pareto (1848-1923) discovered certain types of quantita- 
tive regularities which he could model mathematically in terms of a simple power-law function 
involving only two free parameters (cf. Pareto (1896) [|3~9~1 ). The random variable X underlying 
such a Pareto distribution, 

X ~ Par(j,x min ) , (8.67) 



has a 

Spectrum of values: 

1416 {x\x > Xmin} C K>o 

and a 

Probability density function (pdf ): 




for x < x r 



, 7 G M>o for x > x x 



its graph is shown in Fig. I8.10l below for three different values of the exponent 7. 
Cumulative distribution function (cdf ): 








for tC < ~^i. oc yw\x\ 


F x {x) = P{X <x)= < 








^ j for x > x min 









Expectation value and variance: 

E(X) 
Var(X) 



7 



7-1 



for 7 > 1 



7 



(7-l) 2 ( 7 -2) 



for 7 > 2 . 



(8.68) 



(8.69) 



(8.70) 



(8.71) 
(8.72) 



It is important to realise that for 7 < 1 neither E(X) nor Var(X) are well-defined; for 7 < 2 it is 
only Var(X) which is not well-defined. 



a-quantiles: 



a = F x (x, 



X V1 



•v^> x a = F x 1 (a) 



1 — a 



for all < a < 1 . (8.73) 
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Figure 8.10: pdf of the Pareto distribution according to Eq. (18.691 ) for x min = 

1 ln(5) 5 1 " /5 

-, - . , , - > . The curve with the largest value at x — 1 corresponds to Par ( -, 1 
2'ln(4)'2j 6 F V 2 



1 and 7 e 



Note that it follows from Eq. (18.701) that the probability of a Pareto-distributed continuous random 
variable X to exceed a certain threshold value x is given by the simple power-law rule 



P(X >x) = l- P{X < x) 
Hence, the ratio of probabilities 

P{X > ax) 



x 



(8.74) 



P(X > x) 



ax 



x 



(8.75) 



with a G M>o, is scale-invariant, meaning independent of a particular scale x at which one ob- 
serves X (cf. Taleb (2007) [55, p 256ff and p 326ff]). This behaviour is a direct consequence 
of a special mathematical property of Pareto distributions which is technically referred to as self- 
similarity. It can be defined from the fact that a Pareto-pdf (18.691 ) has constant elasticity, i.e. (cf. 
Ref. O Sec. 7.6]) 

£ fx( x ) = -(7 + 1) for x > x min . (8.76) 

Further interesting examples of distributions of quantities encountered in various applied fields 
of science which also feature the scale-invariance of scaling laws are described in Wiesenfeld 
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(2001) lt6~TTl . Nowadays, Pareto distributions play an important role in the quantitative modelling 
of financial risk (see, e.g., Bouchaud and Potters (2003) 0). 

Working out the equation of the Lorenz curve associated with a Pareto distribution according to 
Eq. (17.291) , using Eq. (18.731) . yields a particularly simple result given by 



L( a;7 ) = l-(l-a) 1 -( 1 /^ 



(8.77) 



This result forms the basis of Pareto's famous 80/20 rule concerning concentration in the dis- 
tribution of an asset of general importance in a given population. According to Pareto's empir- 
ical findings, typically 80% of such an asset are owned by just 20% of the regarded population 
(and vice versa) (cf. Pareto (1896) fl39l)ll The 80/20 rule applies exactly for a power-law index 

m(5) 

7 = z—-t ~ 1.16. It is a prominent example of the phenomenon of universality, frequently ob- 

m(4) 

served in the mathematical modelling of quantitative-empirical relationships between variables in 
a wide variety of scientific disciplines; cf. Gleick (1987) lfT9l p 157ff]. 

In practice, bivariate quantitative-empirical data {(xj, yi)}i=i,..., n for positive variables (X, Y) is 
tested for a Pareto distribution by investigating a scatter plot for the logarithmic quantities ln(?/j) 
against \n(x{) for correlation; cf. Sec. 112.11 Given there is a functional relationship between Y and 
X of the form y = Kx~^ 1+1 \ then the logarithmic quantities are related by 

\n(y) = ln(K) - (7 + 1) x \n(x) , 

i.e., there exists a straight line relationship between \n(y) and ln(x) with negative slope equal to 
-(7 + 1). 



8.10 Power-law distribution 



While the pdf of a Pareto distribution, discussed in the previous section, is proportional to positive 
powers of 1/x, the slightly more general three-parameter power-law distribution, 



X ~ Pl(a; b; c) . 

includes cases with a pdf proportional to positive powers of x itself (i.e., for c > 1). 
Spectrum of values: 

X i-> x G {x\a < x < a + b,b G R> } C K . 
Probability density function (pdf): 







for 








x < a 










fx(x) = < 


c fx — a \ c_1 


c G M>o for 




b{ b ) ' 


a < x < a + b ■ 







for 


x > b 




V 







(8.78) 



(8.79) 



(8.80) 



See also footnote 2 in Subsec. 13.4.21 
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Cumulative distribution function (cdf ): 
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'o for 


X 


< 


a 


F x (x) = P(X < x) = < 










(^J - 


a 


< 


x < a + b . 




1 for 


X 


> 


b 




V 









(8.81) 



Expectation value and variance: 

E(X) = a + ^—b 
v ' c + 1 

Var(X) = -b 2 . 

y ! (c+l) 2 (c + 2) 

a-quantiles: 

I ( X Ob \ C 

a = F x (x a ) = — x a = F x (a) = a + x i for all < a < 1 . 



(8.82) 
(8.83) 

(8.84) 



8.11 Special hyperbolic distribution 

The special hyperbolic distribution for a continuous random variable X, 

X ~ sHyp , (8.85) 

which does not depend on any free parameters, can be considered an exotic but quite simple rep- 
resentative of a continuous probability distribution law. 

Spectrum of values: 

I 4 x 6 [0, 1] C M> • (8.86) 

Probability density function (pdf ): 



1 1 



for x E [0, 1] 



fx(x) 



ln(2) 1 + x 
otherwise 



(8.87) 



its graph is shown in Fig. l8.11l below. 
Cumulative distribution function (cdf): 



F x (x) = P(X < x) = < 



ln(2) 



for x < 



ln(l + x) for x E [0, 1] 



for x > 1 



(8.88) 
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0.0 0.2 0.4 0.6 0.8 1.0 



Figure 8.11: pdf of the special hyperbolic distribution according to Eq. (18.871) . 



Expectation value and variance: 



E(X) 



1 - ln(2) 
ln(2) 



, s |ln(2)-l 
Var (X) = 2 y ' 

(ln(2)) 2 



(8.89) 
(8.90) 



a-quantiles: 



a = F x (x a ) 



ln(2) 



ln(l + x a ) & x a = F^ 1 {a) = e Qln(2) - 1 for all < a< 1 . (8.91) 



8.12 Cauchy distribution 



The French mathematician Augustin Louis Cauchy (1789-1857) is credited for the inception into 
Statistics of the continuous two-parameter distribution law 



X ~ Ca(b; a) 



(8.92) 



with properties 



8. 1 2. CAUCHY DISTRIB UTION 
Spectrum of values: 

Probability density function (pdf ): 



x i->- x e 



fx(x) = - 



ir a 2 + (x — by 



with a G 



, be 
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(8.93) 



(8.94) 



its graph is shown in Fig. !8.12l below for two particular cases. 




i — 

-5 



-10 



10 



Figure 8.12: pdf of the Cauchy distribution according to Eq. (I8.951 ). Displayed are the cases 
Ca(l; 1) (stongly peaked) and Ca(— 1; 3) (moderately peaked). 



Cumulative distribution function (cdf ): 



1 1 fx — b 
Fx{x) = P(X < x) = — I — arctan 

2 7r V a 



Expectation value and variance^] 

E(X): 
Var(X) : 



does NOT exist due to a diverging integral 
does NOT exist due to a diverging integral . 



(8.95) 



(8.96) 
(8.97) 



4 In the case of a Cauchy distribution the fall-off in the tails of the pdf ist not sufficiently fast for the expectation 
value and variance integrals, Eqs. (17.26b and (17.27b . to converge to finite values. 
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See, e.g., Sivia and Skilling (2006) EH p 34]. 
ct-quantiles: 

a = Fx(x a ) <=>■ x a — F^ 1 (a) — b + a tan 

8.13 Central limit theorem 




for all < a < 1 . (8.98) 



Consider a set of n mutually stochastically independent, additive random variables Xi , . . . , X n 
[cf. Eq. (16.131) 1. with (i) finite expectation values . . . , /i n , (infinite variances af,...,a^ which 
are not too different from one another, and (iii) corresponding cdf s F±(x), . . . , F n (x). Define for 
this set a total sum Y n and a sample mean X n according to Eq. (17.341) . In analogy to Eq. (17.321) . 
a standardised summation random variable associated with Y n is given by 

Z n := Yn T^ =1 ^ • (8-99) 

Subject to the convergence condition 

lim max Gi = , (8.100) 

i.e., that asymptotically the standard deviation of the total sum dominates the standard de- 
viations of all individual Xi, and certain additional regularity requirements (cf. Rinne 
(2008) (451 P 427 f]), the central limit theorem in its general form according to the Finnish 
mathematician Jarl Waldemar Lindeberg (1876-1932) and the Croatian-American mathematician 
William Feller (1906-1970) states that in the asymptotic limit of infinitely many X^, 

lim F{z n ) = $(z) , (8.101) 

n— >oo 

i.e., the limit distribution of the standardised summation random variable Z n is given by the 
standard normal distribution iV(0; 1) introduced in Sec. 18 .51 cf. Lindeberg (1922) []32) and 
Feller (1951) [fT3l . Earlier results on the asymptotic distributional properties of a sum of indepen- 
dent random variables were obtained by the Russian mathematician, mechanician and physicist 
Aleksandr Mikhailovich Lyapunov (1857-1918); cf. Lyapunov (1901) If34|. Thus, under fairly 
general conditions, the normal distribution acts as an attractor distribution for the sum of n 
stochastically independent, additive random variables X^H In oversimplified terms: this result 
bears a certain economical convenience for most practical purposes in that, given favourable con- 
ditions, when the size of a random sample is sufficiently large (in practice a typical rule of thumb 
is n > 50), one essentially needs to know the characteristic features of only a single continuous 

5 Put differently, for increasingly large n the cdf of the total sum Y n approximates a normal distribution with 

n n 

expectation value /ij and variance of to an increasingly accurate degree. In particular, all reproductive distri- 

i=l i=l 

butions may be approximated by a normal distribution as n becomes large. 
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probability distribution law to perform, e.g., the statistical testing of hypotheses; cf. Ch. \W\ As 
will become apparent in subsequent chapters, the central limit theorem has profound ramifications 
for applications in all empirical scientific disciplines. 

Note that for finite n the central limit theorem makes no statement as to the nature of the tails 
of the distributions of Z n and of Y n (where, in principle, it can be very different from a normal 
distribution; cf. Bouchaud and Potters (2003) p 25f]). 

A direct consequence of the central limit theorem and its preconditions is the fact that for the 
sample mean X n both 

lim E(X n ) = lim ^ i=1 - and lim Var(X n ) = lim ^ i= ] ^ 

n— >oo n— >oo TL n— >oo n— >oo TL 

converge to finite values. This property is most easily recognised in the special case of n stochas- 
tically independent and identically distributed (in short: "i.i.d") additive random variables 
Xi, . . . , X n , with common finite expectation value fi, finite variance a 2 , and cdf Then, 

lim E(X n ) = lim — = fi (8.102) 

n— >oo n— >oo fi 

- TICS 2 Cf 2 

lim Var(X n ) = lim — = lim — = , (8.103) 

n— >oo n— too n n— >oo TL 

which is known as the law of large numbers according to the Swiss mathematician 
Jacob Bernoulli (1654-1705); the sample mean X n converges stochastically to its expectation 
value \l. 

This ends Part II of these lecture notes and we now turn to Part III in which we focus on a number 
of useful applications of inferential statistical methods of data analysis. 



6 These conditions lead to the central limit theorem in the special form according to Jarl Waldemar Lindeberg 
(1876-1932) and the French mathematician |Paul Pierre Levy (1886-1971)| 
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Chapter 9 



Operationalisation of latent variables: 
Likert's scaling method of summated item 
ratings 

The most frequently practiced method to date of operationalising latent variables (such as "social 
constructs") in the Social Sciences and Humanities is due to the US-American psychologist 
Rensis Likert's (1903-1981), In his 1932 paper OTA , which completed his thesis work for a Ph.D., 
he expressed the idea that latent variables Xl, when they are perceived of as one-dimensional 
in nature, can be rendered measurable in a quasi-metrical fashion by means of the summated 
ratings over an extended list of suitable indicator items Xj (i = 1,2,...). Such indicator 
items are often formulated as specific statements relating to the theoretical concepts underlying 
a particular 1-D latent variable Xl to be measured, with respect to which test persons need 
to express their level of agreement. A typical response scale for the items X 4 , providing the 
necessary item ratings, is given for instance by the 5-level ordinally ranked attributes of agreement 

1 : strongly disagree/strongly unfavourable 

2: disagree/unfavourable 

3: undecided 

4: agree/favourable 

5: strongly agree/strongly favourable. 

In the research literature, one also encounters 7-level or 10-level item rating scales. Note that it is 
assumed that the items X{, and thus their ratings, can be treated as additive, so that the conceptual 
principles of Sec. I7.5l relating to sums of random variables can be relied upon. When forming the 
sum over the ratings of all indicator items selected, it is essential to carefully pay attention to the 
polarity of the items involved. For the resultant total sum £\ X,- t to be consistent, the polarity of 
all items used needs to be uniform Q 

'For a questionnaire, however, it is strongly recommended to include also indicator items of reversed polarity. This 
will improve the overall construct validity of the measurement tool. 
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The construction of a consistent Likert scale for a 1-D latent variable X L involves four basic 
steps (see, e.g., Trochim (2006) [58 1): 

(i) the compilation of an initial list of 80 to 100 potential indicator items Xi for the latent 
variable of interest, 

(ii) the draw of a gauge random sample from the targeted population Q, 

(iii) the computation of the total sum ^\ X { of item ratings, and, most importantly, 

(iv) the performance of an item analysis based on the sample data and the associated total sum 

Xi of item ratings. 

The item analysis, in particular, consists of the consequential application of two exclusion criteria. 
Items are being discarded from the list when either 

(a) they show a weak item-to-total correlation with the total sum £\ X { (a rule of thumb is to 

exclude items with correlations less than 0.5), or 

(b) it is possible to increase the value of Cronbach's^l a-coefficient (see Cronbach (1951) fl9l), 
a measure of the scale's internal consistency reliability, by excluding a particular item 
from the list (the objective being to attain a-values greater than 0.8). 

For a set of m E N indicator items JQ, Cronbach's a-coefficient is defined by 

n := 1 i ! i - ~~~7T~ 1 • (9.1) 



ni 



Em q2 
i=l D i 



m-lj V S t 2 ota i 



where Sf denotes the sample variance associated with the ith indicator item, and S" t 2 otal is the 
sample variance of the total sum ^ Xi. 

SPSS: Analyze — > Scale — > Reliability Analysis . . . (Model: Alpha) — >■ Statistics . . . : Scale if item 
deleted 

R: alpha (items) (package: psych) 

The outcome of the item analysis is a drastic reduction of the initial list to a set of k E N items X, 
(i = 1, . . . , k) of high discriminatory power (where typically k is an integer in the range of 10 to 
15). The associated total sum 

k 

X L :=Y,X* (9-2) 

i=i 

thus operationalises the 1-D latent variable Xl in a quasi-metrical fashion since it is to be measured 
on an interval scale with a discretised spectrum of values given by 



k 

X L h^-^Xi E [lk, 5k] 
i=i 



(9.3) 
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1-D latent variable X L : 

• ItemXi: strongly disagree strongly agree 

• Item X 2 : strongly disagree strongly agree 



• Item Xk'- strongly disagree strongly agree 

Table 9.1: Structure of a /^indicator- item Likert scale for some 1-D latent variable Xl, based on 
a visualised equidistant 5-level item rating scale. 

The structure of a finalised fc-indicator-item Likert scale for some 1-D latent variable Xl with an 
equidistant graphical 5-level item rating scale is displayed in Tab. 19.11 

Likert' s scaling method of aggregating information from a set of A; ordinally scaled items to form 
an effectively quasi-metrical total sum X L = £\ draws its legitimisation to a large extent from 
the central limit theorem (cf. Sec. 18. 13b . In practice it is found that in many cases of interest the 
total sum Xl = J2i %i i s normally distributed in its samples to a very good approximation. The 
main shortcoming of Likert's approach is its dependency of the gauging process of the scale on the 
targeted population. 



2 Named after the US-American educational psychologist 
normalised a-coefficient is the real-valued interval [0,1]. 



Lee Joseph Cronbach (1916-2001) 



The range of the 
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Chapter 10 



Random sampling of populations and 
statistical testing of hypotheses 

Quantitative-empirical research methods may be employed for exploratory as well as for con- 
firmatory data analysis. Here we will focus on the latter. To investigate research questions 

systematically by statistical means, with the objective to make inferences about the distributional 
properties of a set of statistical variables on the basis of analysis of data from just a few units in 
a sample to an entire population Q worth of statistical units, the following three issues have to 
be addressed in a clearcut fashion: 

(i) the target population fl of the research activity needs to be defined in an unambiguous way, 

(ii) an adequate random sample Sn needs to be drawn from Q, and 

(iii) a reliable mathematical procedure for estimating quantitative population parameters 

from random sample data needs to be employed. 

We will briefly discuss these issues in turn, beginning with a review in Tab. UO.ll of conventional 
notation for distinguishing specific statistical measures relating to populations from the corre- 
sponding ones relating to random samples. Towards the end of this chapter, we will highlight the 
principles underlying a systematic statistical testing of hypotheses regarding specific distribu- 
tional features of observable quantities. 

Random variables in a population fl (of size N) will be denoted by capital Latin letters such 
as X, Y, . . . , Z, while their realisations in random samples (of size n) will be denoted by 
lower case Latin letters such as Xj, jji, . . . , z% (i = 1, . . . , n). In addition, one denotes population 
parameters by lower case Greek letters, while for their corresponding point estimator functions 
relating to random samples, which are also perceived of as random variables, again capital Latin 
letters are used for representation. The ratio n/N will be referred to as the sampling fraction. As 
is standard in the literature, we will denote a particular random sample of size n for a random 
variable X by a set S^: (Xi, . . . , X n ). 

In actual practice, it is often not possible to acquire access for the purpose of enquiry to every 
single statistical unit belonging to an identified target population f2, not even in principle. For 
example, this could be due to the fact that fi's size N is far too large to be determined accurately. 
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r opuiaiion & l 


KaiKioni sample 


population size N 


sample size n 


arithmetical mean /j, 


sample mean X n 


standard deviation a 


sample standard deviation S n 


median x _ 5 


sample median X .5,n 


correlation coefficient p 


sample correlation coefficient r 


rank correlation coefficient p s 


sample rank correl. coefficient r s 


regression coefficient (intercept) a 


sample regression intercept a 


regression coefficient (slope) (5 


sample regression slope b 



Table 10.1: Notation for distinguishing between statistical measures relating to a population ft 
on the one-hand side, and to the corresponding quantities and unbiased point estimator functions 
obtained from a random sample Sa on the other. 
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In this case, to ensure a thorough investigation, one needs to resort to using a sampling frame 
representative of fl. By this one understands a representative list of elements in CI to which access 
may actually be obtained. Such a list will have to be compiled by some authority of scientific 
integrity. In an attempt to avoid a notational overflow, we will subsequently continue to use N to 
denote both: the size of Q and the size of its associated sampling frame (even though this is not 
entirely correct). 

We now proceed to introduce the three most commonly practiced methods of drawing random 
samples from given fixed populations ft of statistical units. 



10.1 Random sampling methods 
10.1.1 Simple random sampling 

The simple random sampling technique can be best understood in terms of the urn model of 
combinatorics introduced in Sec. 16.41 Given a population Q of N distinguishable statistical units, 

( N \ 

there is a total of I J distinct possibilities of drawing samples of size n from 17, given the 

order of selection is not being accounted for. A simple random sample is then defined by the 
property that its probability of selection is equal to 

' , (10.1) 



N 
n 



according to the Laplacian principle of Eq. (16.8b . This has the immediate consequence that the a 
priori probability of selection of any single statistical unit is given by3 



N - 1 



n N — n n 

1 — = — . (10.2) 



N \ N N 

n 

On the other hand, the probability that two statistical units i and j are being selected for the same 
sample amounts to 

n n — 1 

— x . (10.3) 

N N- 1 V ' 

As such, by Eq. (16.131) . the selection of two statistical units is not stochastically independent (in 
which case the joint probability would be n/N x n/N). However, for sampling fractions n/N < 
0.05, stochastic independence of the selection of statistical units generally holds to a reasonably 
good approximation. When, in addition, n > 50, likewise the conditions for the central limit 
theorem in the variant of Lindeberg and Levy to apply (cf. Sec. 18.131 ) hold to a rather good degree. 



1 In the literature this particular property of a random sample is referred to as "epsem": equal probability of selection 
method. 
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10.1.2 Stratified random sampling 

Stratified random sampling adapts the sampling process to an intrinsic structure of the population 
fl as provided by the k mutually exclusive and exhaustive categories of some qualitative (nominal 
or ordinal) variable, which thus defines a set of k strata (layers) of Q. By construction, there are 
Ni statistical units belonging to the zth stratum (i = 1, . . . , k). Simple random samples of sizes 
rii are drawn from each stratum according to the principles outlined in Subsec. flO.l.ll yielding a 
total sample of size n = n\ + . . . + n^. Frequently applied variants of this sampling technique are 
(i) proportionate allocation of statistical units, defined by the condition! 

m i N rii n 

n-N =* ivTiV ; (1 °- 4) 

in particular, this allows for a fair representation of minorities in fl, and (ii) optimal allocation 
of statistical units which aims at a minimisation of the resultant sample variances of the variables 
investigated. Further details on the stratified random sampling technique can be found, e.g., in 
Bortz and Doring (2006) [El p 425ff]. 



10.1.3 Cluster random sampling 

When the population Q naturally subdivides into an exhaustive set of K mutually exclusive clus- 
ters of statistical units, a convenient sampling strategy is given by selecting k < K clusters from 
the set at random and perform complete surveys within each of the chosen clusters. The probability 
of selection of any particular statistical unit from fl thus amounts to k/K. This cluster random 
sampling method has the practical advantage of being less contrived. However, in general it entails 
sampling errors that are greater than in the previous two sampling approaches. Further details on 
the cluster random sampling technique can be found, e.g., in Bortz and Doring (2006) [5, p 435ff]. 



10.2 Point estimator functions 

Many inferential statistical methods of data analysis revolve around the estimation of unknown 
distribution parameters 6 with respect to some population Q. by means of corresponding point 
estimator functions 8 n (X\, . . . , X n ) (or statistics), the values of which are computed from the 
data of a random sample Sn: (X\, . . . , X n ). Owing to the stochastic nature of the random sam- 
pling approach, any point estimator function 9 n (X 1 , . . . ,X n ) is subject to a random sampling 
error. One can show that this estimation procedure becomes reliable provided that a point estima- 
tor function satisfies the following two important criteria of quality: 

(i) Unbiasedness: E(8 n ) = 9 

(ii) Consistency: lim Var(^ n ) = 0. 

n— ¥oo 



2 Note that, thus, this also has the "epsem" property. 
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For metrically scaled random variables X, defining for a given random sample S^ : (X 1; . . . , X n ) 
of size n a sample total by 



y n :=^X i; (10.5) 



the two most prominent point estimator functions satisfying the unbiasedness and consistency 

conditions are the 

sample mean: X n := — Y n (10.6) 



n 

n 



sample variance: S 2 := — - — 'S^(X i —X n ) 2 . 

n — 1 / — 4 



(10.7) 



These will be frequently employed in subsequent considerations for point-estimating the popula- 
tion parameters \i and a 2 . Sampling theory holds it that the sizes of the standard errors asso- 
ciated with the point estimator functions (|10.61 ) and (110.71) amount to the standard deviations of 
the underlying theoretical sampling distributions of these functions. For given population f2 of 
size N, imagine drawing a very long sequence of mutually independent random samples of a fixed 
size n, from each of which individual realisations of X n and S 2 are obtained. In the limit that this 
sequence becomes infinitely long, and n is kept fixed, the theoretical sampling distributions of 
X n and S 2 are normal (cf. Sec. 18.51 ) resp. x 2 with n — \ degrees of freedom (cf. Sec. I8.6I ). with 
standard deviations 

% resp. \P-;S 2 n ; (10.8) 



cf., e.g., Lehman and Casella (1998) Q281 p 91ff], and Levin et al (2010) W\ Ch. 6]. Thus, for a 
finite sample standard deviation S n , these two standard errors decrease with the sample size n 
proportional to the inverse of y/n resp. the inverse of \Jn — 1. 



10.3 Statistical tests of hypotheses 
10.3.1 General procedure 

The statistical testing of hypotheses by means of observable quantities is the centre- 
piece of the current body of inferential statistical methods. Its logic of an ongoing rou- 
tine of a systematic falsification of hypotheses by empirical means is firmly rooted in the 
ideas of critical rationalism and logical positivism, as expressed most emphatically by the 
Austro-British philosopher |Sir Karl Raimund Popper CH FRS FBA (1902-1 994); see, e.g., Pop- 
per (2002) [44] . The systematic procedure for statistically testing hypotheses, as practiced 
today as a standardised method of probability-based decision-making, was developed dur- 
ing the first half of the 20 th Century, predominantly by the English statistician, evolution- 
ary biologist, eugenicist and geneticist Sir Ronald Aylmer Fisher FRS (1890-1962), the Polish- 
US-American mathematician and statistician Jerzy Neyman (1894-1981), the English mathe- 
matician and statistician Karl Pearson FRS (1857-1936), and his son, the English statistician 
|Egoh Sharpe Pearson CBE FRS (1895-1980)} cf. Fisher (1935) [16], Neyman and Pearson 
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(1933) 113811 . and Pearson (1900) [|40l . We will describe the main steps of the systematic test 
procedure in the following. 

The central aim of the statistical testing of hypotheses is to separate true effects in a targeted pop- 
ulation Q of statistical units concerning distributional properties of, or relations between, selected 
statistical variables X,Y, . . . , Z from chance effects due to the sampling approach to probing the 
nature of fl. The latter results in a generally unavoidable state of incomplete information on the 
part of the researcher. 

In an inferential statistical context, hypotheses are formulated as assumptions on 

(i) the probability distribution function F of one or more random variables X, Y, . . . , Z in 

fl, or 

(ii) one or more parameters 9 of this distribution function. 

Genetically, statistical hypotheses need to be viewed as probabilistic statements. As such the 
researcher will always have to deal with a fair amount of uncertainty in deciding whether a par- 
ticular effect is significant in fl or not. Bernstein (1998) [2, p 207] summarises the circumstances 
relating to the test of a specific hypothesis as follows: 

"Under conditions of uncertainty, the choice is not between rejecting a hypothesis 
and accepting it, but between reject and not-reject." 

The question arises as to which kinds of quantitative problems can be efficiently settled by statis- 
tical means! With respect to a given target population Q, in the simplest kinds of applications 
of hypothesis tests, one may (a) test for differences in the distributional properties of a single 
statistical variable X between a number of subgroups of fl, necessitating univariate methods of 
data analysis, or one may (b) test for association between two statistical variables X and Y, thus 
requiring bivariate methods of data analysis. The standardised test procedure takes the following 
steps on the way to making a decision: 

Schematic of the statistical test of a hypothesis 

1. Formulation, with respect to fl, of a pair of mutually exclusive hypotheses: 

(a) the null hypothesis H conjectures that "there exists no effect in fl of the kind 
envisaged," while 

(b) the research hypothesis Hi conjectures that "there does exist a true effect in fl of the 
kind envisaged." 

The starting point of the procedure is the assumption (!) that it is the H conjecture which 
holds true in Q. The objective is to try to refute H empirically on the basis of random 
sample data drawn from f2, to a certain level of significance which needs to be specified in 
advance. In this sense it is H which is being subjected to a statistical testO The striking 
asymmetry regarding the roles of H and H\ in the test procedure forms the basis of the 
method of falsification of hypotheses advocated by critical rationalism. 

3 Bernstein (1998) E] p 209] refers to the statistical test of a hypothesis as a "mathematical stress test". 
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2. Fixing of a significance level a prior to the test, where, by convention, a G [0.01,0.05]. 
The parameter a is synonymous with the probability of committing a Type I error (to be 
defined below) in making a test decision. 

3. Construction of a suitable continuous real- valued measure, a test statistic T n (Xi, . . . , X n ), 
for quantifying deviations of the data from a random sample S^: (X 1; . . . , X n ) of size n 
from the initial "no effect in f2" conjecture of H , with known (!) associated theoretical 
distribution for computing corresponding event probabilities. 

4. Determination of the rejection region B a for H within the spectrum of values of the test 
statistic T n (X\, . . . , X n ) from re-arranging the condition 

P(T n (X 1 ,...,X n ) G B a \H Q ) <a. (10.9) 



5. Computation of a specific realisation t n (xi, . . . , x n ) of the test statistic T n (X 1 , . . . , X n ) 
from data xi, . . . , x n of a random sample S^: . . . , X n ). 

6. Obtaining a test decision on the basis of one of the alternative criteria: when for the 
realisation t n (xi, . . . , x n ) of the test statistic T n (X%, . . . , X n ), resp. the corresponding 
p-value (to be defined in Subsec. U0.3.2l below)FI 

(i) t n G B a , resp. p-value < a =>■ reject H , 

(ii) t n £ B a , resp. p-value > a not reject H . 

When performing a statistical test of a null hypothesis H , the researcher is in risk of making a 
wrong decision. Hereby, one distinguishes the following two kinds of error: 

Type I error: reject an H which, however, is true, with probability P(H\\Hq true) = a, and 
Type II error: not reject an H which, however, is false, with probability P(H \Hi true) = j3. 

By fixing the significance level a prior to running a statistical test, one simultaneously controls the 
risk of a Type I error. We condense the different possible outcomes when making a test decision 
in the following table: 



4 The statistical software packages SPSS and R provide p-values as a means for making decisions in the statistical 
testing of hypotheses. 



90 CHAPTER 1 0. RANDOM SAMPLING AND HYPOTHESES TESTING 



Consequences of test decisions: 







H Q Decision for: Hi 




H true 


correct decision: Type I error: 

P(H \H true) = 1 - a P(Hi\H true) = a 

Type II error: correct decision: 
P(H \Hi true) = /3 P(#i|#i true) = 1 - 


Reality / Q: 


Hi true 



While the probability a is required to be specified a priori to a statistical test, the probability (3 can 
be computed only a posteriori. One refers to the probability 1 — j3 associated with the latter as the 
power of a particular statistical test. 

Note that in the complementary Bayesian approach to statistical data analysis (cf. Sub sec. 16.5.21) 
the empirical testing of hypotheses follows the logic 

P(hypofhesis|data) oc P(data|hypofhesis) x P(hypothesis) , (10.10) 

thus requiring information on (i) the joint probability distributions of parameters and random 
variables (with parameters treated as random variables) and the distribution of random sample data 
underlying the probability P(data|hypofhesis), as well as specification of (ii) a subjective prior 
probability P(hypothesis). This information is not necessarily always available though. Further 
details on the Bayesian method are given in, e.g., Sivia and Skilling (2006) [|47l p 6], or Lupton 
(1993) El p 50ffJ. 

10.3.2 Definition of a p-value 

Def.: Let T n (Xi, . . . , X n ) be the test statistic underlying a particular statistical test, the theoret- 
ical distribution of which be known under the assumption that the null hypothesis H holds true 
in f2. If t n (xi, . . . , x n ) is the realisation of T n (Xi, . . . , X n ) computed from the data xi,...,x n 
of a random sample S^: (Xi, . . . , X n ), then the p-value associated with t n (xi, . . . , x n ) is defined 
as the probability of obtaining a value of T n (X 1; . . . , X n ) which is more extreme than the given 
t n (xi, . . . , x n ), given that the null hypothesis applies. 

Specifically, using the computational rules (|7.22I) - (I7.24I ). one obtains for a 
• two-sided statistical test, 

p := P(T n <-\t n \\H ) + P(T n > \t n \\H ) 
= P(T n <-\t n \\H ) + l-P(T n <\t n \\H ) 

= FT n (-\t n \) + l-FT n (\tn\). (10.11) 



10.3. STATISTICAL TESTS OF HYPOTHESES 



91 



This result specialises to p = 2 [1 — F rn (|t„|)] if the respective pdf exhibits reflection 
symmetry with respect to a vertical axis at t n = 0, i.e., when F Tn (—\t n \) = 1 — F Tn (\t n \) 
holds. 

• left- sided statistical test, 

p := P(T n < t n \H ) = F Tn (t n ) , (10.12) 

• right- sided statistical test, 

p := P{T n > t n \H ) = 1 - P(T n < t n \H ) = 1 - F Tn {t n ) . (10.13) 

With respect to the test decision criterion of rejecting an H whenever p < a, one refers to 
(i) cases with p < 0.05 as significant test results, and to (ii) cases with p < 0.01 as highly 
significant test results. 

Remark: User- friendly routines for the computation of p-values are available in SPSS, R and 
EXCEL, and also on some GDCs. 

In the following two chapters, we will turn to discuss a number of standard problems in Inferential 
Statistics, in association with the quantitative-empirical tools that have been developed to tackle 
them. In Ch.[TT]we will be concerned with problems of a univariate nature, in particular, testing 
for differences in the distributional properties of a single random variable X between two of more 
subgroups of some population fl, while in Ch. [12] the problems at hand will be of a bivariate 
nature, testing for statistical association in f2 between pairs of random variables (X, Y). 
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Chapter 11 



Univariate methods of statistical data 
analysis: confidence intervals and testing 
for differences 



In this chapter we present a selection of standard inferential statistical techniques that, based on 
random sampling of some population Q, were developed for the purpose of (a) estimating unknown 
distribution parameters by means of confidence intervals, (b) testing for differences between a 
given empirical distribution of a random variable and its a priori assumed theoretical distribution, 
and (c) comparing distributional properties and parameters of a random variable between two or 
more subgroups of fl. Since the methods to be introduced relate to considerations on distributions 
of a single random variable only, they are thus referred to as univariate. 



11.1 Confidence intervals 

Assume given a continuous random variable X which satisfies in some population fl a Gaufiian 
normal distribution with unknown distribution parameters \i and a 2 (cf. Sec. 18.51) . The issue is to 
determine, using data from a random sample Sn: (X\, . . . , X n ), a two-sided confidence interval 
estimate for any one of these unknown distribution parameters 9 which can be relied on at a 
confidence level I — a, where, by convention, a G [0.01, 0.05]. Centred on a suitable unbiased 
and consistent point estimator function 9 n (Xi, . . . , X n ), the aim of the estimation process is to 
explicitly account for the sampling error 5k arising due to the random selection process. This 
approach yields a two-sided confidence interval 



= n -S K J n + 6 K 



(11.1) 



such that P{9 G Ki- a (6)) = 1 — a applies. In the following we will consider the two cases which 
result when choosing 9 G {pi, a 2 }. 
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11.1.1 Confidence intervals for a population mean 

When 9 — and 9 n = X n by Eq. (110.61) . the two-sided confidence interval for a population 

mean \i at significance level I — a becomes 

K 1 . a (ji)= [X n -5 K ,X n + 5 K ] , (11.2) 
with a sampling error amounting to 

S 

8k = t n -l;l-a/2 — r= , (11-3) 

\/n 



where S n is the positive square root of the sample variance according to Eq. (110.71) . and 
t n -i;i-a/2 denotes the value of the (1 — a/2)-quantile of a ^distribution with df = n — 1 degrees 

S 

of freedom; cf. Sec. 18.71 The ratio — ^ is the standard error associated with X n . 

*n 



GDC: mode STAT -)> TESTS -)■ TInterval 

Equation d 1 1-31) may be inverted to obtain the minimum sample size necessary to construct a two- 
sided confidence interval for ll to a prescribed accuracy 5 max , maximal sample variance cr^ ax , and 
at fixed confidence level 1 — a. Thus, 

i "n— 1:1— a/2 \ o ^1 1 ^\ 

"max / 



11.1.2 Confidence intervals for a population variance 

When 9 = a 2 , and 9 n = S 2 by Eq. (110.71) . the associated point estimator function 

(n - 1)5„ 2 



(T 2 



with nGN, (11.5) 



satisfies a ^-distribution with df = n — 1 degrees of freedom; cf. Sec. 18.61 By inverting the 
condition 

P (Xn- W 2 < ^ ~ a 2^" < xt-l ; l- a / 2 ) = 1 " <* , (H.6) 

one derives a two-sided confidence interval for a population variance o 2 at significance level 
1 — a given by 

r (n-l)^ [n-\)Sl 



2 ' 2 

^n-l;l-a/2 ^n-l;«/2 



(11.7) 



Xn-i-a/2 an( ^ Xn-i-i-a/2 a S am denote the values of particular quantiles of a ^-distribution. 



1 1 .2. ONE-SAMPLE X 2 -GOODNESS-OF-FIT-TEST 

11.2 One-sample x 2 -goodness-of-fit-test 
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A standard research question in quantitative-empirical investigations deals with the issue whether 
or not, with respect to some population Q of sample units, the distribution law of a specific ran- 
dom variable X may be assumed to comply with a particular theoretical reference distribution. 
This question can be formulated in terms of the corresponding cdf s, Fx(x) and F (x), presup- 
posing that for practical reasons the spectrum of values of X is subdivided into a set of k mutually 
exclusive categories (or bins), with k a judiciously chosen integer which depends in the first place 
on the size n of the random sample S^: (Xi, . . . , X n ) to be investigated. 

The non-parametric one-sample x 2 -goodness-of-fit-test takes as its starting point the pair of 

Hypotheses: 

'H Q : F x {x) = F (x) ^0,-^ = (]J g) 

H, : F x {x) ± F (x) & Oi-E^O 

where Oi (i = 1, . . . , k) denotes the actually observed frequency of category i in a random sample 
of size n, Ei := npi denotes the under H (and so F (x)) theoretically expected frequency of 
category i in the same random sample, and pi is the probability of finding a value of X in category 
i under F (x). 

The present procedure, devised by Pearson (1900) 00], employs the residuals Oi — Ei (i = 
1 . . . , k) to construct a suitable 

Test statistic: 

JL (n. _ tp..\* „. 

(11.9) 

{Oi - Ei) 2 

in terms of a sum of rescaled squared residuals , which, under H , approximately satis- 

Ei 

ties a x 2 -distribution with df = k — l — r degrees of freedom (cf. Sec. 18.61 ). where r is the number 
of unknown parameters of the reference distribution F (x) which need to be estimated from the 
random sample data. For this test procedure to be reliable it is important (!) that the size n of the 
random sample be chosen such that the condition 

E t > 5 (11.10) 

holds for all categories i = 1, . . . , k, due to the fact that the Ei appear in the denominator of the 
test statistic in Eq. (111.91) . 

Test decision: The rejection region for H at significance level a is given by (right-sided test) 

*n > Xfc-l-rfl-a ■ (H-H) 

By Eq. (110.131) . the p-value associated with a realisation t n of (II 1.91) amounts to 

p = P(T n > t n \H ) = 1 - P(T n < t n \H ) = 1 - X 2 cdf (0, tn, k-l-r). (11.12) 

SPSS: Analyze — > Nonparametric Tests — >■ Legacy Dialogs — > Chi-square . . . 

Note that in the spirit of critical rationalism the one-sample x 2 -goodness-of-fit-test provides a 
tool for empirically excluding possibilities of distribution laws for X. 
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11.3 One-sample t- and Z-tests for a population mean 

The idea here is to test whether the unknown population mean // of some continuous random 
variable X is equal to, less than, or greater than some reference value fx , to a given significance 
level a. To this end, it is required that X satisfy in the population f2 a GauBian normal dis- 
tribution, i.e., X ~ N(fx; a 2 ); cf. Sec. 18.51 The quantitative-analytical tool to be employed in 
this case is the parametric one-sample t-test for a population mean developed by Student [Gos- 
set] (1908) [1521 . or, when the sample size n > 50, in consequence of the central limit theorem 
discussed in Sec. 18. 13l the corresponding one-sample Z-test. 

For an independent random sample S^: (Xi,...,X n ) of size n, normality of the 
X-distribution can be tested for by a procedure due to the Russian mathematicians 
Andrey Nikolaevich Kolmogorov (1903-1987) and Nikolai Vasilyevich Smimov (1900-1966) , 
which tests the null hypothesis H : "There is no difference between the distribution of the sample 
data and a normal distribution" against the alternative H\\ "There is a difference between the dis- 
tribution of the sample data and a normal distribution"; cf. Kolmogorov (1933) E51 and Smirnov 
(1939) [|48l . This procedure is referred to as the Kolmogorov-Smirnov-test (or, for short, the 
KS-test). The associated test statistic evaluates the strength of the deviation of the empirical cu- 
mulative distribution function [cf. Eq. (12.41) 1 of given random sample data with sample mean x n 
and sample variance s 2 n from the cdf of a reference GauBian normal distribution with parameters 
fi and cr 2 equal to these sample values [cf. Eq. (18.401) 1. 

SPSS: Analyze — > Nonparametric Tests — > Legacy Dialogs — > 1 -Sample K-S . . . : Normal 
R: ks . test (variable, "pnorm" ) 

Formulated in a non-directed or a directed fashion, the starting point of the t-test resp. Z-test 
procedures are the 

Hypotheses: 



To measure the deviation of the sample data from the state conjectured to hold in the null hypoth- 
esis Ho, the difference between the sample mean X n and the hypothesised population mean fi , 
normalised in analogy to Eq. (17.321 ) by the standard error 

4 (11-14) 

of X n given in Eq. (110.81) . serves as the /i -dependent 
Test statistic: 



H : ii = /j, or fi > fi or fi < fi Q 
Hx'.h^Hq or n < fiQ or p > fj, 



(11.13) 




t(n - 1) for n < 50 



JV(0; 1) for n > 50 



(11.15) 



which, under H , satisfies a t-distribution with df = n — 1 degrees of freedom (cf. Sec. 18.71 ) resp. a 
standard normal distribution (cf. Sec. I8.5t . 
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Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 



Kind of test 


H 




Rejection region for H 


(a) two-sided 


H = Ho 


H ± Ho 


| > |t n -l;l- a /2 (t-tB&t) 

\zi- a / 2 (Z-test) 


(b) left-sided 


H > Ho 


fj, < Ho 


t < < ^ n_1;Q = _ tn-l-A-a (t— test) 

}z a = —zi_ a (Z— test) 


(c) right-sided 




H > Ho 


JVin-c (t-test) 
[z^ a (Z-test) 



p-values associated with realisations t n of (II 1.151) can be obtained from Eqs. (110.1 1 b — (1 10.1 3b . 

GDC: mode STAT -> TESTS ->■ T-Test . . . when n < 50, resp. mode STAT -> TESTS ->■ 
Z-Test . . . when n > 50. 

SPSS: Analyze — > Compare Means — > One-Sample T Test . . . 
R: t . test {variable, mu=/^o) , 

t . test (variable, mu=/io, alternat ive=" less " ) , 
t .test (variable , mu=j2 , alternat ive="greater" ) 

Note: Regrettably, SPSS provides no option to select between a "one-tailed" (left-/right- sided) and 
a "two-tailed" (two-sided) t-test. The default setting is for a two-sided test. For the purpose of 
one-sided tests the p-value output of SPSS needs to be divided by 2. 



11.4 One-sample % 2 -test for a population variance 

In analogy to the statistical significance test described in the previous section [TT31 one may like- 
wise test hypotheses on the value of an unknown population variance a 2 with respect to a reference 
value Op for a continuous random variable X which satisfies in Q a GauBian normal distribu- 
tion, i.e., X ~ N(fji] a 2 ); cf. Sec. 18.51 These may also be formulated in a non-directed or directed 
fashion according to 



Hypotheses: 



H :a 2 = a 2 or a 2 > a 2 or a 2 < a 2 

tt 2 / 2 2 2 2 2- (11.16) 

Hi: a 0q or a A < or a > af 



'o 

In the one-sample x 2 -test for a population variance, the underlying -dependent 
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Test statistic: 

(n-l)S 2 Ho „, 2 ,„ ^71 (1117) 



T n [Xi, . . . , X n ) 



X (n 



is chosen to be proportional to the sample variance defined by Eq. (110.71) . and so, under H , satisfies 
a ^-distribution with df = n — 1 degrees of freedom; cf. Sec. 18.61 

Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 



Kind of test 


Ho 


Hi 


Rejection region for H 


(a) two-sided 






. J < X n -l;a/2 

1 > y 2 

\^ A.n-l;l-a/2 


(b) left-sided 


a 2 > a 2 


a 2 < a 2 


< Xn-l;a 


(c) right-sided 


° 2 < °l 


a 2 > a 2 


tn > Xn-l;X—a 



p-values associated with realisations t n of (|1 1.171) can be obtained from Eqs. (110.1 1 b — (| 10.1 3b . 

Regrettably, the one- sample % 2 -test for a population variance does not appear to have been imple- 
mented in the SPSS software package. 



11.5 Two independent samples t-test for a population mean 

Quantitative-empirical studies are frequently interested in the question as to what extent there 
exist significant differences between two subgroups of some population fl in the distribution of 
a metrically scaled variable X. Given that X is normally distributed in fl (cf. Sec. 18.51) . the 
parametric two independent samples t-test for a population mean originating from work by 
Student [Gosset] (1908) If52ll provides an efficient and powerful investigative tool. 

For independent random samples of sizes ni,n 2 > 50, normality of the X-distribution can be 
tested for by the Kolmogorov-Smirnov-test; cf. Sec. 111.31 

SPSS: Analyze — > Nonparametric Tests — > Legacy Dialogs — > 1 -Sample K-S . . . : Normal 
R: ks . test (variable, "pnorm" ) 

In addition, prior to the t-test procedure, one needs to establish whether or not the variances of X 
are significantly different in the two random samples selected. Levene's test provides an empirical 
method to test H : a\ = a\ against H x ■ a\^ o\\ cf. Levene (1960) ll29~ll . 

R: levene.test (variable -group variable) (package: car) 



11.5. INDEPENDENT SAMPLES T-TEST FOR A MEAN 
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The hypotheses of a t-test may be formulated in a non-directed fashion or in a directed one. Hence, 
the different kinds of conjectures are 

Hypotheses: (test for differences) 



H : fa — fa = or fa — fa > or fa — fa < 
Hi : fa — fa 7^ or fa — fa < or fa — fa > 



(11.18) 



A test statistic is constructed from the difference of sample means, X ni — X n2 , standardised by the 
standard error 



5 V V 

- v ra 2 



(11.19) 



/ C2 C2 

ni n 2 

which derives from the associated theoretical sampling distribution. Thus, one obtains the 
Test statistic: 

(11.20) 



5 



which, under i/ > is t-distributed (cf. Sec. 18.71 ) with a number of degrees of freedom determined 
by the relations 

n>i + n>2 — 2 , when of = of 



df:= < 



o2 o 
"1 I "2 

ni ri2 



2 \ 2 



(11.21) 



(SV»l) 2 (^ 2 /»2) 2 
k ni — 1 ri2 — 1 



when erf 7^ of 



Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 



Kind of test 


//„ 


//, 


Rejection region for iT 


(a) two-sided 


fa - fa = 


fa- n 27 ^o 




(b) left-sided 


fa - fa > 


fa - fa < 




(c) right-sided 


/ii - fa < 


Aii — > 


tni,n2 ^* tdf;l— a 



p-values associated with realisations t niin2 of (111.201) can be obtained from Eqs. (110.1 1 1) — (1 10.131) . 

GDC: mode STAT -)• TESTS -)■ 2-SampTTest . . . 

SPSS: Analyze — )■ Compare Means — > Independent-Samples T Test . . . 
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R: t . test (variable- group variable) , 

t .test {variable -group variable , alternat ive=" less " ) , 
t.test {variable- group variable , alternat ive="greater" ) 

Note: Regrettably, SPSS provides no option to select between a one-sided and a two-sided t-test. 
The default setting is for a two-sided test. For the purpose of one-sided tests the p-value output of 
SPSS needs to be divided by 2. 

When the necessary conditions for the application of the independent sample i-test are not satis- 
fied, the following alternative test procedures (typically of a weaker test power, though) for com- 
paring two subgroups of Q with respect to the distribution of a metrically scaled variable X exist: 

(i) at the nominal scale level, provided > 5 for all the % 2 -test for homogeneity; cf. 
Sec.HOOlbelow, and 

(ii) at the ordinal scale level, provided n\,n 2 > 8, the two independent samples Mann- 
Whitney-C/-test for a median; cf. the following Sec. II 1.61 

11.6 Two independent samples Mann-Whitney-t/-test for a 
population median 

The non-parametric two independent samples Mann-Whitney-t/-test for a popu- 
lation median, devised by the Austrian-US-American mathematician and statistician 
Henry Berthold Mann (1905-2000) and the US-American statistician Donald Ransom Whit- 
ney (1915-2001) in 1947 11361 . can be applied to random sample data for ordinally scaled variables 
X, or for metrically scaled variables X which are not normally distributed in the population Q. 
In both situations, the method employs rank data (cf. Sec. 14.31) . which faithfully represents the 
original random sample data, to compare the medians of X between two independent groups. It 
aims to test empirically one of the following pairs of non-directed or directed 

Hypotheses: (test for differences) 



Given two independent sets of random sample data for X, ranks are being introduced on the basis 
of an ordered joint random sample of size n = n\ + n 2 according to Xi(l) !->■ -R[xj(l)] and 
Xi(2) i — y R[xi(2)]. From the ranks thus assigned to each of the two sets of data, one computes the 



H Q : 5 .5(1) = £ .s(2) or £0.5(1) > £0.5(2) or £ . 5 (1) < £o.s(2) 
Hi : £0.5(1) ^ £0.5(2) or £0.5(1) < £0.5(2) or £ . 5 (1) > £0.5(2) 



(11.22) 



[/-values: 




(11.23) 



i=i 




(11.24) 



i=i 
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which are subject to the identity XJ\ + U 2 = rajn 2 . Choose U := min([/j, f/ 2 )0 For independent 
random samples of sizes raj., n 2 > 8 (see, e.g., Bortz (2005) [41 p 151]), the standardised [/-value 
serves as the 



Test statistic: 



T 



ni,n.2 • 



U - Hu H 
01/ 



iV(0;i; 



(11.25) 



which, under if > approximately satisfies a standard normal distribution; cf. Sec. 18.51 Here, p,y 
denotes the mean of the [/-value expected under H ; it is defined in terms of the sample sizes by 



fJ-u ■'- 



rajra 2 



(11.26) 



while ou is the standard error of the [/-value and can be obtained, e.g., from Bortz (2005) [ffl 
Eq.(5.49)]. 

Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 



Kind of test 


//,, 


//, 


Rejection region for H 


(a) two-sided 


£0.5(1) = £0.5(2) 


£0.5(1) ^ £0.5(2) 


|^ni,?i2l ^* — ce/2 


(b) left-sided 


£0.5(1) > £0.5(2) 


£0.5(1) < £0.5(2) 




(c) right-sided 


£0.5(1) < £0.5(2) 


£0.5(1) > £0.5(2) 





p-values associated with realisations t n ^ n2 of (|1 1.251) can be obtained from Eqs. (110.1 1 1) — <1 10.131) . 

SPSS: Analyze — > Nonparametric Tests — >■ Legacy Dialogs — > 2 Independent Samples . . . : Mann- 
Whitney U 

R^wilcox.test {variable -group variable) , 

wilcox.test (variable -group variable , alternat ive=" less " ) , 
wilcox.test (variable- group variable , alternative="greater" ) 

Note: Regrettably, SPSS provides no option to select between a one-sided and a two-sided [/-test. 
The default setting is for a two-sided test. For the purpose of one-sided tests the p-value output of 
SPSS needs to be divided by 2. 

'Since the [/-values are tied to each other by the identity U% + U% = Ti\n2, it makes no difference to this method 
when one chooses U := max(f/i, ETjj) instead. 
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11.7 Two independent samples F-test for a population vari- 
ance 



In analogy to the independent samples t-test for a population mean of Sec. II 1.51 one may likewise 
investigate for a metrically scaled variable X which satisfies a GauBian normal distribution in fl 
(cf. Sec. 18.51) whether there exists a significant difference in the value of the population variance 
between two independent random samples]^ The parametric two independent samples -F-test 
for a population variance evaluates the non-directed resp. directed pairs of 



Hypotheses: 



H : o\ = o\ or o\ > erf or a\ < a\ 
Hi : o\ or o\ < o\ or o\ > <j\ 



(test for differences) 
(11.27) 



Dealing with independent random samples of sizes m and n 2 , the ratio of the corresponding sample 
variances serves as a 



Test statistic: 



S 2 

n I - II 2 ■ g2 



T„, .„.„ := ~ F(ni - l,n 2 - 1) 



(11.28) 



which, under H , satisfies an F-distribution with dfi — n\ — 1 and df 2 = n 2 — 1 degrees of 
freedom; cf. Sec. 18.81 

Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 



Kind of test 


//., 


Hi 


Rejection region for H 


(a) two-sided 


o\ = o\ 


a\ ± al 


. 1 < l//n 2 -l,n.i-l;l-a/2 
r ni,n 2 \ . 

y> Jni-l,n 2 -l;l-a/2 


(b) left-sided 


a\ > a 2 2 


o\ < al 


^ni,n2 *^ 1/ fnu— l,ni— a 


(c) right-sided 


a 2 < a 2 2 


al > al 


tni,n2 ^* fni — l,n% — 1;1— Q 



p-values associated with realisations t ni)Tl2 of (|11.28l) can be obtained from Eqs. (110.1 1 1) — (1 10.131) . 

GDC: mode STAT -> TESTS -)■ 2-SampFTest . . . 
R: var . test {variable- group variable) , 



2 Run the Kolmogorov-Smirnov-test to check for normality of the distribution of X in the two random samples. 
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var.test {variable- group variable , alternat ive=" less " ) , 
var.test {variable- group variable , alternat ive="greater" ) 

Regrettably, the two-sample F-test for a population variance does not appear to have been imple- 
mented in the SPSS software package. Instead, to address quantitative issues of the kind raised 
here, one may resort to Levene's test; cf. Sec. II 1 .51 



11.8 Two dependent samples t-test for a population mean 

Besides investigating for significant differences in the distribution of a single variable X in two 
or more independent subgroups of some population fl, many research projects are interested in 
finding out (i) how the distributional properties of a variable X have changed within one and the 
same random sample of Q in an experimental before-after situation, or (ii) how the distribution of 
a variable X differs between two subgroups of fl the sample units of which co-exist in a natural 
pairwise one-to-one correspondence to one another. 

When the variable X in question is metrically scaled and satisfies a GauBian normal distribution 
in fl, significant differences can be tested for by means of the parametric two dependent samples 
t-test for a population mean. Denoting by A and B either some before and after instants, or the 
partners in a set of natural pairs (A, B), define for X the metrically scaled difference variable 

D := X(A) - X(B) . (11.29) 



An important test prerequisite demands that D itself be normally distributed in f2; cf. Sec. 18.51 
Whether this property holds true can be checked via the Kolmogorov-Smirnov-test; cf. Sec. ll 1.31 

With u, D denoting the population mean of the difference variable D, the 

Hypotheses: (test for differences) 



H : u, D = or u, D > or /i D < 
H\ : 7^ or u D < or u D > 



(11.30) 



can be given in a non-directed or a directed formulation. From the sample mean D and its associ- 
ated standard error, 

% (11-31) 



one obtains by means of standardisation according to Eq. (17.321) the 
Test statistic: 

D 

(11.32) 



which, under H , satisfies a t-distribution with df = n — 1 degrees of freedom; cf. Sec. 18.71 

Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 
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Kind of test 


//„ 


//, 


Rejection region for H 


(a) two-sided 


fi D = 




\t n \ > tn-\;X-a/2 


(b) left- sided 


fi D > 


/i D < 


tn *^ "n— l;a ^n— 1;1— a 


(c) right- sided 


flD < 


> 


t n > ^n— 1;1— a 



p-values associated with realisations £ n of (II 1.321) can be obtained from Eqs. (110.1 It — (110. Ob - 
SPSS: Analyze — > Compare Means — > Paired-Samples T Test . . . 
R: t . test {variableA , variableB , paired="T" ) , 

t .test {variableA, variableB, paired="T", alternat ive=" less " ) , 
t . test (variableA, variableB, paired="T", alternat ive="greater" ) 

Note: Regrettably, SPSS provides no option to select between a one-sided and a two-sided t-test. 
The default setting is for a two-sided test. For the purpose of one-sided tests the p-value output of 
SPSS needs to be divided by 2. 



11.9 Two dependent samples Wilcoxon-test for a population 
median 

When the test prerequisites of the dependent samples i-test cannot be met, i.e., a given metri- 
cally scaled variable X cannot be assumed to satisfy a GauBian normal distribution in f2, or X is 
an ordinally scaled variable in the first place, the non-parametric signed ranks test published by 
the US-American chemist and statistician Frank Wilcoxon (1892-1965) in 1945 Il62l constitutes a 
quantitative-empirical tool for comparing the distributional properties of X between two depen- 
dent random samples drawn from fl. Like Mann and Whitney's [/-test discussed in Sec. 111.61 it 
is built around the idea of rank data representing the original random sample data; cf. Sec. 14.31 
Defining again a variable 

D := X(A) — X(B) , (11.33) 
with associated median x , 5 (D), the non-directed or directed pairs of 

Hypotheses: (test for differences) 

'Ho : x . 5 {D) = or x o . 5 {D)>0 or x _ 5 (D) < 
Fx : x . 5 (D) ^ or x . 5 {D) < or x . 5 (D) > 



need to be subjected to a suitable significance test. 
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For realisations dj (i = 1, . . . , n) of D, introduce ranks according to dj H- i?[|dj|] for the ordered 
absolute values \di\, while keeping a record of the sign of each dj. Exclude from the data set all 
null differences dj — 0, leading to a sample of reduced size n i-)- n rc d . Then form the sums of 
ranks iy + for the dj > and for the d, < 0, respectively, which are linked to one another by 
the identity W+ + W~ = n Ted (n red + l)/2. Choose 1^+0 For reduced sample sizes n red > 20 
(see, e.g., Rinne (2008) Il45l p 552]), one employs the 



Test statistic: 



T 



^red 



N(0;1) 



(11.35) 



which, under H Q , approximately satisfies a standard normal distribution; cf. Sec. 18.51 Here, the 
mean u. w + expected under H is defined in terms of n red by 



(11.36) 



while the standard error a w + can be computed from, e.g., Bortz (2005) [4, Eq. (5.52)]. 

Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 



Kind of test 




//, 


Rejection region for H 


(a) two-sided 


Xo.b(D) = 


x . 5 (D) + 




(b) left-sided 


x . 5 {D) > 


x . 5 (D) < 




(c) right-sided 


x . 5 {D) < 


x .b{D) > 


tn lcd > Zl-a 



p-values associated with realisations t Urcd of (II 1.351) can be obtained from Eqs. (| 1 0. 1 1 1) — (1 10.131) . 

SPSS: Analyze — > Nonparametric Tests — > Legacy Dialogs — > 2 Related Samples . . . : Wilcoxon 
R: wilcox.test {variableA, variableB , paired="T" ) , 

wilcox . test (variableA, variableB, paired="T", alternat ive=" less " ) , 
wilcox . test (variableA, variableB, paired="T", alternat ive="greater" ) 

Note: Regrettably, SPSS provides no option to select between a one-sided and a two-sided 
Wilcoxon-test. The default setting is for a two-sided test. For the purpose of one-sided tests 
the p-value output of SPSS needs to be divided by 2. 



3 Due to the identity W + + W = ri ro d(n rc d + l)/2, choosing instead W would make no qualitative difference 
to the subsequent test procedure. 
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11.10 % 2 -test for homogeneity 

Due to its independence of scale levels, the non-parametric % 2 -test for homogeneity constitutes 
the most generally applicable statistical test for significant differences in the distributional proper- 
ties of a particular variable X between k £ N different subgroups of some population Q. By as- 
sumption, the variable X may take values in a total of I G N different categories aj (j = 1,. . . ,1). 
Begin by formulating the 

Hypotheses: (test for differences) 

H : X satisfies the same distribution in all k subgroups of O ^ 
Hi : X satisfies different distributions in at least two subgroups of f2 

With Oij denoting the observed frequency of category aj in subgroup % (i = 1, . . . , k), and the 
under H expected frequency of category a,- in subgroup i, the sum of rescaled squared residuals 



(Oij - E^) 
E^ 



provides a useful 



Test statistic: 

fO,, _ K-A* ~ 

(11.38) 




Under H , this test statistics satisfies approximately a ^-distribution with df = (k — 1) x (/ — 1) 
degrees of freedom; cf. Sec. 18.61 The are defined in terms of marginal observed frequencies 
O i+ (the sum of observed frequencies in row i; cf. Eq. (14.31) ) and + j (the sum of observed 
frequencies in column j; cf. Eq. (14.41) ) by 



E : =°i+°+L. (11.39) 
n 

Note the important (!) test prerequisite that the total sample size n := rii + . . . + n k be such that 

Eij>5 (11.40) 

applies for all categories a, and subgroups i. 

Test decision: The rejection region for H at significance level a is given by (right-sided test) 

tn > X(As-l)x(J-l);l-a • (H-41) 

By Eq. (110.131) . the p-value associated with a realisation t n of (111.381) amounts to 

p = P(T n > t n \H ) = 1 - P(T n < t n \H ) = 1 - X 2 cdf (0,t n , (k-l)x(l- 1)) . (11.42) 

GDC: mode STAT -> TESTS ->■ % 2 -Test . . . 

SPSS: Analyze — > Descriptive Statistics — > Crosstabs . . . — > Statistics . . . : Chi-square 
R^chisq.test ( group variable , variable ) 

Typically the power of a x 2 -test for homogeneity is weaker than for the related two procedures of 
comparing independent subgroups of fl which will be discussed in the subsequent Secs. ll l.lTl and 

am 
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11.11 One-way analysis of variance (ANOVA) 

This powerful quantitative-analytical tool has been developed in the context of investigations on 
biometrical genetics by the English statistician | Sir Ronald Aylmer Fisher F RS (1890-1962) 
(see Fisher (1918) [14]), and later extended by the US-American statistician 
|Henry Scheffe (1907-19777 (see Scheffe (1959) 051). It is of a parametric nature and can 
be interpreted alternatively as a method foio 

(i) investigating the influence of a qualitative variable Y with k > 3 categories a* (i — 1, . . . , k), 
generally referred to as a "factor", on a quantitative variable X, or 

(ii) testing for differences of the mean of a quantitative variable X between k > 3 different 
subgroups of some population fl. 

A necessary condition for the application of the one-way analysis of variance (ANOVA) test 
procedure is that the quantitative variable X to be investigated needs to be (a) normally distributed 
(cf. Sec. 18.51) in the k > 3 subgroups of the population Q considered, with, in addition, (b) equal 
variances. Both of these conditions also have to hold for each of a set of k mutually stochastically 
independent random variables Xi, . . . , X^ representing k random samples drawn independently 
from the identified k subgroups of Q, of sizes ni, . . . , £ N, respectively. In the following, the 
element Xy of the underlying (n x 2) data matrix X represents the jth value of X in the random 
sample drawn from the ith subgroup of f2, with Xi the corresponding subgroup sample mean. 
The k independent random samples can be understood to form a total random sample of size 

k 

n := ni + . . . + n k = >J n*, with total sample mean X n ; cf. Eq. (110.61) . 

i=i 

The intention of the ANOVA procedure in the variant (ii) stated above is to empirically test the 
Hypotheses: (test for differences) 

H Q ■ Hi = ■ ■ ■ = = Ak) (1 1 43) 

Hi : Hi 7^ Ho at least for one i = 1, . . . ,k 

The necessary test prerequisites can be checked by (a) the Kolmogorov-Smirnov-test for nor- 
mality of the X-distribution in each of the k subgroups of f2 (cf. Sec. 111.31) . and likewise by 
(b) Levene's test for H : a\ — ... — a\ — a\ against H\\ "erf ^ ctq at least for one i — 1, . . . , k" 
to test for equality of the variances in these k subgroups (cf. Sec. II 1 .51) . 

R: levene.test (variable -group variable) (package: car) 

The starting point of the ANOVA procedure is a simple algebraic decomposition of the random 
sample values X^ into three additive components according to 

X l3 =X n + {Xi - X n ) + (X ij -Xi). (11 .44) 

This expresses the X^ as the sum of the total sample mean, X n , the deviation of the subgroup 
sample means from the total sample mean, (Xj — X n ), and the residual deviation of the sample 



4 Only experimental designs with fixed effects are considered here. 
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values from their respective subgroup sample means, (Xy — Xj). The decomposition of the Xy 
motivates a linear stochastic model for the population ft of the forrrjf] 

in O : Xij = /x + a { + £„• (1 1.45) 

in order to quantify, via the ai (i = 1, . . . , k), the potential influence of the qualitative variable Y 
on the quantitative variable X. Here /zo is the population mean of X, it holds that Yli=i n i a i = 0' 

and it is assumed for the random errors e i3 - that '~' N(0; a^), i.e., that they are identically 
normally distributed and mutually stochastically independent. 

Having established the decomposition (II 1.441) . one next turns to consider the associated set of 
sums of squared deviations defined by 

BSS := J^(X;-X n ) 2 = ^niiXi-Xn) 2 (11.46) 
i=i j=i i=i 

k rii 

RSS := ^^(.V, ; • .Y,) 2 (11.47) 

i=l j=l 
k rii 

TSS := ^^(X y -X n ) 2 , (11.48) 

i=l j=l 

where the summations are (i) over all n; sample units within a subgroup, and (ii) over all of the 
k subgroups themselves. The sums are referred to as, resp., (a) the sum of squared deviations be- 
tween the subgroup samples (BSS), (b) the residual sum of squared deviations within the subgroup 
samples (RSS), and (c) the total sum of squared deviations (TSS) of the individual Xy from the 
total sample mean X n . It is a fairly elaborate though straightforward algebraic exercise to show 
that these three squared deviation terms relate to one another according to the strikingly elegant 
identity (cf. Bosch (1999) [6, p 220fJ) 

TSS = BSS + RSS. (11.49) 

Now, from the sums of squared deviations (111 .46b — (1 1 1 .48b - one defines, resp., the total sample 
variance, 

-"^EE (Xi, - = . (1 1.50) 

i=l j=l 

involving df = n — 1 degrees of freedom, the sample variance between subgroups, 

k 

1 ^ 2 BSS 

^between := 7~T ^ (Xj - X n ) = ^ — , (11.51) 

1=1 



5 Formulated in the context of this linear stochastic model, the null and research hypotheses are Hq : a\ 
dfc = and Hi : at least one a, ^ 0, respectively. 
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with df = k — 1, and the mean sample variance within subgroups, 

k n. 

1 

'within 



r,2 
J., 



n — k 



RSS 

n — k 



(11.52) 



i=l j=l 

for which df = n — k. 

Employing the latter two subgroup-specific dispersion measures, the set of hypotheses (112.1 II) may 
be recast into the alternative form 

Hypotheses: 



TT . 02 < 9 2 

11 o • "-^between — within 
2 ^ c2 



TT . C2 ^ C2 

ll\ . ^between within 

Finally, as a test statistic for the ANOVA procedure one chooses the ratio of variances^] 
„ (sample variance between subgroups) BSS /(k — 1) 



(test for differences) 
(11.53) 



- n,k 



(mean sample variance within subgroups) RSS/(n — k) ' 

expressing the size of the "sample variance between subgroups" in terms of multiples of the "mean 
sample variance within subgroups"; it thus constitutes a relative measure. A real effect of differ- 
ence between subgroups is thus given when the non-negative numerator turns out to be significantly 
larger than the non-negative denominator. Mathematically, this statistical measure of deviations 
between the data and the null hypothesis is given by 

Test statistic Jzl 



T 

J- 7 



C2 
J] 



ii, k 



between ^0 



C2 
within 



F(k — l,n — k) 



(11.54) 



Under H , it satisfies an F-distribution with df i = k — 1 and df2 = n — k degrees of freedom; cf. 
Sec. - 



It is a well-established standard in practical applications of the one-way ANOVA procedure to 
present the results in the form of a 

Summary table: 



ANOVA 


sum of 


df 


mean 


test 


variability 


squares 




square 


statistic 


between groups 


BSS 


k-1 


C2 

between 


tn,k 


within groups 


RSS 


n — k 


C2 
within 




total 


TSS 


n — 1 







6 This ratio is sometimes given as T n> k 



(explained variance) 



(unexplained variance) 



, in analogy to expression ( 112.10b below. Occa- 



sionally, one also considers the coefficient rf := 



BSS 



TSS 



which, however, does not account for the degrees of freedom 



involved. In this respect, the modified coefficient r) 2 



between 
c2 

^total 



would constitute a more sophisticated measure. 



7 Note the one-to-one correspondence to the test statistic (II 1.281 l employed in the independent samples F-test for 
a population variance. 
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Test decision: The rejection region for H Q at significance level a is given by (right- sided test) 

tn,k fk~X,n—k;X—a • 

(11.55) 

With Eq. (|10.13l) . the j9-value associated with a specific realisation t n>k of (II 1.541) amounts to 

p = P(T n , fc > t n>k \H ) = 1 - P(T n , k < t ntk \H ) = 1 - Fcdf (0, t n>k , k - l,n - k) . (11.56) 

GDC: mode STAT -> TESTS ->■ ANOVA ( 
SPSS: Analyze ->■ Compare Means -»■ One-Way ANOVA . . . 
R: anova (lm (variable- group variable) ) (variances equal), 
oneway . test [variable- group variable) (variances not equal) 

When a one-way ANOVA yields a statistically significant result, so-called post-hoc tests need to 
be run subsequently in order to identify those subgroups % whose means p^ differ most drastically 
from the reference value p . The Student-Newman-Keuls-test (Newman (1939) (371 and Keuls 
(1952) (231), e.g., successively subjects the pairs of subgroups with the largest differences in sam- 
ple means to independent samples t-tests; cf. Sec. II 1.51 Other useful post-hoc tests are those 
developed by Holm-Bonferroni (Holm (1979) [22]), Tukey (Tukey (1977) (59l), or by Scheffe 
(Scheffe (1959) ||46ll). 

SPSS: Analyze ->■ Compare Means ->■ One-Way ANOVA ...-)■ Post Hoc . . . 

R: pairwise . t . test {variable, group variable, p . ad j = "bonf erroni " ) 



11.12 Kruskal-Wallis-test for a population median 

Finally, a feasible alternative to the one-way ANOVA when the conditions for its legiti- 
mate application cannot be met, or one is interested in the distributional properties of a spe- 
cific ordinally scaled variable X, is given by the non-parametric significance test devised by 
the US- American mathematician and statistician William Henry Kruskal (1919-2005) and the 
US-American economist and statistician |Wilson Allen Wallis (1912-1998)1 in 1952 J26l. The 
Kruskal-Wallis-test serves to detect significant differences for a population median of an or- 
dinally or metrically scaled variable X between k > 3 independent subgroups of some population 
fl. To be investigated is the pair of mutually exclusive 

Hypotheses: (test for differences) 

H : £ .5(1) = ••• =x . 5 (k) (1157) 
Hi : at least one x , 5 (i) [i — 1, . . . , k) is different from the other group medians 

Introduce ranks according to Xj(l) h-> R[xj(1)], . . . , and Xj(k) i-> R[xj(k)] within the random 
samples drawn independently from each of the k > 3 subgroups of Q on the basis of an ordered 

k 

joint random sample of size n := n\ + . . . + n k = ^ nf, cf. Sec. 14.31 Then form the sum of 



ranks for each random sample separately, i.e., 



R +i :=J2R[xj(i)] (i = l,...,k). (11.58) 

3=1 



11.12. KRUSKAL-WALLIS-TEST 
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Provided the sample sizes satisfy the condition > 5 for all k > 3 independent random samples 
(hence, n > 15), the test procedure can be based on the 

Test statistic: 



which, under H , approximately satisfies a ^-distribution with df = k — 1 degrees of freedom 



(cf. Sec. [83; see, e.g., Rinne (2008) 03 p 553]. 

Test decision: The rejection region for H at significance level a is given by (right-sided test) 



By Eq. (110.131) . the p-value associated with a realisation t n ^ of (II 1.591) amounts to 

p = P(T n , fc > t n , k \H ) = 1 - P(T n , fc < ^liZo) = 1 - X 2 cdf (0,t n , fe , k — 1). (11.61) 

SPSS: Analyze — > Nonparametric Tests — > Legacy Dialogs — > K Independent Samples . . . : 
Kruskal-Wallis H 

R: kruskal . test {variable -group variable) 




(11.59) 



> Xk-ia 



(11.60) 
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Chapter 12 



Bivariate methods of statistical data 
analysis: testing for association 



Recognising patterns of regularity in the variability of data sets for given (observable) variables, 
and explaining them in terms of causal relationships in the context of a suitable theoretical 
model, is one of the main objectives of any empirical scientific discipline; see, e.g., Penrose 
(2004) [|43l . Causal relationships are viewed as intimately related to interactions between ob- 
jects or agents of the physical or/and of the social kind. A necessary condition on the way of 
theoretically fathoming causal relationships is to establish empirically the existence of significant 
statistical associations between the variables in question. The possibility of replication of obser- 
vational or experimental results of this kind, when given, lends strong support in favour of this 
idea. Regrettably, however, the existence of causal relationships between two variables cannot be 
established with absolute certainty by empirical means; compelling theoretical arguments need to 
take over. Causal relationships between variables imply an unambiguous distinction between in- 
dependent variables and dependent variables. In the following, we will discuss the principles of 
the simplest three inferential statistical methods providing empirical checks of the aforementioned 
necessary condition in the bivariate case. 

12.1 Correlation analysis and simple linear regression 
12.1.1 t-test for a correlation 

The parametric correlation analysis presupposes metrically scaled variables X and Y which 
satisfy a bivariate normal distribution in Q. Its aim is to investigate whether or not X and 
Y feature a quantitative-statistical association of a linear nature, given random sample data 
X e M nx2 . Formulated in terms of the population correlation coefficient p according to 
|Auguste Bravais (1811-1863)1 and |Karl Pearson FRS (1857-1936)1 the method tests H against 
Hi in one of the alternative pairs of 

Hypotheses: (test for association) 



H : p = or p > or p < 
Hi : p / or p < or p> 



(12.1) 
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with -1 < p < +1. 

Normality of the marginal X- and F-distributions in a given random sample S^'. (Xi, . . . , X n ;- 
Yi, . . . , Y n ) drawn from fl can again be tested for (for n > 50) by the Kolmogorov-Smirnov-test; 
cf. Sec. lll.3l A scatter plot of the raw sample data {(xi, yi)}i=i,..., n represents features of the joint 
(X, Y) -distribution. 

SPSS: Analyze — > Nonparametric Tests — > Legacy Dialogs — > 1 -Sample K-S . . . : Normal 
R: ks . test (variable, "pnorm" ) 

Rescaling the sample correlation coefficient r of Eq. (14.191) by the inverse of the standard error 

of r, 

ll-r 2 

\ 9 ' (12 - 2) 

V n — I 

which can be obtained from the theoretical sampling distribution of r, presently yields the 
Test statistic: 



~° tin -2) 



(12.3) 



which, under H , satisfies a t-distribution with df = n — 2 degrees of freedom; cf. Sec. 18.71 

Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 



Kind of test 


//,, 


//, 


Rejection region for H 


(a) two-sided 


p = 




\t n \ > t n -2;l-a/2 


(b) left-sided 


P > o 


p < 


tn *^ ^ra— 2;q ^n— 2;1— a 


(c) right- sided 


P < o 


p > 


tn ^* t n —2;l—a 



p-values associated with realisations t n of (112.31) can be obtained from Eqs. (110.1 1 b — (1 10.1 3b . 

SPSS: Analyze — > Correlate — > Bivariate . . . : Pearson 
R^cor.test ( variable 1 , variable2 ) , 

cor . test [variablel , variable!, alternat ive=" less " ) , 
cor .test [variablel , variable2 , alternat ive="greater" ) 

It is generally recommended to handle significant test results of a correlation analysis for met- 
rically scaled variables X and Y with some care due to the possibility of spurious correlations 
induced by additional control variables Z, . . . acting in the background. To exclude this possibil- 
ity, the correlation analysis should be repeated for homogeneous subgroups of the sample So,. 
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12.1.2 F-test of a regression model 

For correlations between metrically scaled variables X and Y significant in Q at level a, where p 
takes a magnitude in the interval 

0.6 < \p\ < 1.0 , 

it is meaningful to ask which linear mathematical model best represents the detected linear statis- 
tical association; cf. Pearson (1903) [42]. To this end, simple linear regression seeks to devise a 
linear stochastic regression model for the population Q of the form 

in ft: Yi = a + fixi + Ei (i = l,...,n), (12.4) 

which, for instance, assigns X the role of an independent variable (and so its values x^ can be 
considered prescribed by the modeller) and Y the role of a dependent variable, thus rendering the 
model essentially univariate in nature. The regression coefficients a and j3 denote the unknown 
^/-intercept and slope of the model in Q. For the random errors e, it is assumed that 

e^^ N(0;a 2 ) , (12.5) 

meaning they are identically normally distributed (with mean zero and constant variance a 2 ) 
and mutually stochastically independent. With respect to the random sample S^: (Xi, . . . , X n ;- 
Yi, . . . , Y n ), the supposed linear relationship between X and Y is expressed by 

in Sa : yi = a + bxi + ei (i = 1, . . . , n) . (12.6) 

Residuals are thereby defined according to 

ei := yi - iji = yi - a - bxi (i = 1, . . . , n) , (12.7) 

which, for given value of x i5 encode the difference between the observed realisations and the 
corresponding (by the linear regression model) predicted realisations of Y. By construction the 

n 

residuals satisfy the condition = 0. 

i=i 

Next, introduce sums of squared deviations for the F-data in line with the ANOVA procedure of 
Sec. nm i.e., 

n 

TSS := Y,(Vi -V) 2 < 12 - 8 ) 

n n 

RSS := Y^iVi-Vif = E e * 2 ' (U - 9) 

i=i i=i 

In terms of these quantities, the coefficient of determination of Eq. (15.81) for assessing the 
goodness-of-the-fit of a regression model can be expressed by 

TSS — RSS (total variance of Y) — (unexplained variance of Y) 
= TSS = (total variance of Y) ' ( ' ) 
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where the latter equality holds for simple linear regression (with just a single independent variable) 
only. The normalised measure B, of range < B < 1, expresses the proportion of variability in 
a data set of Y which can be explained by the corresponding variability of X through the best-fit 
regression model. 

In the methodology of a regression analysis, the first issue to be addressed is to test the significance 
of the overall regression model (112.41) . i.e., to test H against Hi in the set of 

Hypotheses: (test for differences) 

ff » : " = ° . (12.11) 

Exploiting the goodness-of-the-fit aspect of the regression model quantified by the coefficient of 
determination (|12.10l) . one derives the (see, e.g., Hatzinger and Nagel (2009) 11201 Eq. (7.8)]) 

Test statistic 



(12.12) 



which, under H , satisfies an F-distribution with dfi = 1 and df 2 = n — 2 degrees of freedom; cf. 
Sec. I 



Test decision: The rejection region for H at significance level a is given by (right-sided test) 

tn > /l,„-2;l-a ■ (12-13) 

With Eq. (110.131) . the p-value associated with a specific realisation t n of (112.121) amounts to 

p = P{T n > t n \H ) = 1 - P{T n < t n \H ) = 1 - Fcdf (0, t n , 1, n - 2) . (12.14) 



12.1.3 t-test for the regression coefficients 

The second issue to be addressed in a systematic regression analysis is to test statistically which 
of the regression coefficients in Eq. (112.41 ) is non-zero. In the case of simple linear regression, 
though, the matter for the coefficient (3 is settled already by the .F-test of the regression model 
just outlined, resp. the t-test for p described in Subsec. 112.1.11 see, e.g., Levin et al (2010) [|30l 
p 389f]. However, when extending the regression analysis of data to the more complex case of 
multiple linear regression, an approach frequently employed in the research literature of the 
Social Sciences and Economics, this question attains relevance in its own right. In view of this 
prospect, we continue with our methodological considerations. 

First of all, unbiased point estimators for the regression coefficients a and (3 in Eq. (112.41) are ob- 
tained from application to the data of GauB' method of minimising the sum of squared residuals 

(RSS) (cf. GauB (1809) El and Ch.0), 

minimise RSS = > ef 



'Note that with the identity B = r 2 of Eq. i5.9h which applies in simple linear regression, this is just the square 
of the test statistic ( 112.31 ). 



12.1. CORRELATION ANALYSIS AND SIMPLE LINEAR REGRESSION 
yielding 



7 SY 

o = — r 

sx 



and a = Y — bx . 
The equation of the best-fit linear regression model is thus given by 



117 



(12.15) 



S- 



y = Y + — r (x — x) , 

sx 



(12.16) 



and can be employed for purposes of generating predictions for Y, given an independent value of 
X in the empirical interval [x^, £(„)]. 



Next, the standard errors associated with the values of the point estimators a and b in Eq. (112.151) 
are derived from the corresponding theoretical sampling distributions and amount to (cf., e.g., 
Hartung et al (2005) B2TJ p 576ff]) 



SE a 
SE b 



1 x 
n (n — l)s 2 x 
SE e 



SE fi 



y/n — 1 Sx 

where the standard error of the residuals e, is defined by 



(12.17) 
(12.18) 



SE R :-- 



n 



(12.19) 



We now describe the test procedure for the regression coefficient (3. To be tested is H against Hi 
in one of the alternative pairs of 



Hypotheses: 

' H : = or (3 > or (3 < 
#i:/3^0 or (3 < or /3>0 

Dividing the sample regression slope b by its standard error yields the 
Test statistic: 



(test for differences) 



(12.20) 



T 



SE 



~ t(n — 2) 



(12.21) 



which, under H , satisfies a t-distribution with df = n — 2 degrees of freedom; cf. Sec. 18.71 

Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 
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Kind of test 


Ha 


//, 


Rejection region for H 


(a) two-sided 


= 




\t n \ > in-2;l-a/2 


(b) left-sided 


(3 > 


p < 


^ tn—2;a ^n— 2;1— a 


(c) right- sided 


< 


f3 > 


' > ^n— 2;1— a 



p-values associated with realisations t n of (|12.21l) can be obtained from Eqs. (| 1 0. 1 1 b — (| 10. 131) . 
We emphasise once more that for simple linear regression the test procedure just described is 
equivalent to the correlation analysis of Sub sec. 112.1.11 

An analogous i-test needs to be run to check whether the regression coefficient a is non-zero, 

a 

too, using the ratio - — as the test statistic. However, in particular when the origin of X is not 
SE a 

contained in the empirical interval [xm,xr n )], the null hypothesis H : a = is a meaningless 
statement. 

GDC: mode STAT -)> TESTS -)■ LinRegTTest . . . 

SPSS: Analyze — > Regression — > Linear 

R: summary (lm (variable 1 ~variable2) ) 

Note: Regrettably, SPSS provides no option to select between a one-sided and a two-sided t-test. 
The default setting is for a two-sided test. For the purpose of one-sided tests the p-value output of 
SPSS needs to be divided by 2. 

Lastly, by means of an analysis of the residuals one can assess the extent to which the prerequisites 
of a regression analysis stated in Eq. (112.51 ) are satisfied: 

(i) for n > 50, normality of the distribution of residuals (i = 1, . . . , n) can be checked by 
means of a Kolmogorov-Smirnov-test; cf. Sec. lll.3l; 

(ii) homoscedasticity of the (i = 1, . . . , n), i.e., whether or not they have constant variance, 
can be investigated qualitatively in terms of a scatter plot that marks the standardised 
(along the vertical axis) against the corresponding predicted F-values (i = 1, . . . ,n) 
(along the horizontal axis). A circularly shaped envelope of the cloud of points indicates 
that homoscedasticity applies. 

In reality, many quantitative phenomena studied in the Natural Sciences and in the Social Sciences 
prove to be of an inherently non-linear nature; see e.g. Gleick (1987) [fT9l , Penrose (2004) [1431 . 
and Smith (2007) ll49l . On the one hand, this increases the level of complexity involved in the 
data analysis, on the other, non-linear processes offer the reward of a plethora of interesting and 
intriguing (dynamical) phenomena. 



12.2. RANK CORRELATION ANALYSIS 

12.2 Rank correlation analysis 
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When the variables X and Y are metrically scaled but not normally distributed in the population 
O, or X and Y are ordinally scaled in the first place, the standard tool for testing for a statisti- 
cal association between X and Y is the parametric rank correlation analysis developed by the 



English psychologist and statistician Charles Edward Spearman FRS (1863-1945) in 1904 IpTi 
This approach, like the univariate test procedures of Mann and Whitney, Wilcoxon, and Kruskal 
and Wallis discussed in Ch.QTl is again fundamentally rooted in the concept of ranks representing 
statistical data which have a natural order, introduced in Sec. | 



(12.22) 



Following the translation of the original data pairs into corresponding rank data pairs, 

(x h yi) \-> [R(xi), R(yi)] (i = 1, ■ ■ ■ , n) , 



the objective is to subject H in the alternative sets of 
Hypotheses: 



Hi 



: Ps — or p s > or p s < 
Hi : ps 7^ or p s < or 



Ps > 



(test for association) 
(12.23) 



with p s (— 1 < ps < +1) the population rank correlation coefficient, to a test of statistical 
significance at level a. Provided the size of the random sample is such that n > 30 (see, e.g., Bortz 
(2005) [ffl p 233]), there exists a suitable 



Test statistic: 



T 

J. , > 



n 



rs 



Ho 



tin - 2) 



(12.24) 



which, under H , approximately satisfies a t-distribution with df = n — 2 degrees of freedom; cf. 
Sec. 18.71 Here, r s denotes the sample rank correlation coeffcient defined in Eq. (14.311) . 

Test decision: Depending on the kind of test to be performed, the rejection region for H at 
significance level a is given by 



Kind of test 


//„ 


//, 


Rejection region for H 


(a) two-sided 


Ps = 


Ps^O 


\t n \ > t n -2;l-a/2 


(b) left-sided 


Ps>0 


p s <0 


tn *^ t-n—2;a ^n— 2;1— a 


(c) right- sided 


Ps<0 


p s >0 


tn ^* t n — 2;1— a 
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p-values associated with realisations t n of (112.241) can be obtained from Eqs. (110.1 1 b — (110. 131 >. 

SPSS: Analyze — > Correlate — > Bivariate . . . : Spearman 

R: cor .test (variable 1 , variable2, method=" spearman" ) , 

cor . test (variablel , variable2, method=" spearman" , alternat ive=" less " ) , 
cor .test (variable! , variable2, method=" spearman" , 
alternat ive="greater" ) 



12.3 x 2 -test for independence 



The non-parametric % 2 -test for independence constitutes the most generally applicable signifi- 
cance test for bivariate statistical associations. Due to its formal indifference to the scale levels of 
the variables X and Y involved in the investigation, it may be used for statistical analysis of any 
kind of pairwise combinations between nominally, ordinally and metrically scaled variables. The 
advantage of generality of the method is paid for at the price of a typically weaker test power. 

Given qualitative and/or quantitative variables X and Y that take values in a spectrum of k mutually 
exclusive categories ai, . . . , a* resp. I categories 6 1; . . . , fy, the intention is to subject H in the pair 
of alternative 

Hypotheses: (test for association) 

H : There does not exist a statistical association between X and Y in ft 

(12.25) 

Hi : There does exist a statistical association between X and Y in fl 
to a convenient empirical significance test at level a. 

A conceptual issue that requires special attention along the way is the definition of a reasonable 
"zero point" on the scale of statistical dependence of variables X and Y (which one aims to estab- 
lish). This problem is solved by recognising that a common feature of sample data for variables 
of all scale levels is the information residing in the distribution of (relative) frequencies over (all 
possible combinations of) categories and drawing an analogy to the concept of stochastic indepen- 
dence of two events as defined in probability theory by Eq. (16.131) . In this way, by definition we 
refer to variables X and Y as being mutually statistically independent provided that the relative 
frequencies of all bivariate combinations of categories (a*, bj) are numerically equal to the 
products of the univariate marginal relative frequencies h i+ of a« and h + j of bj (cf. Sec. 14. II) . i.e., 

= h i+ h +j . (12.26) 

Translated into the language of random sample variables, viz. introducing sample observed fre- 
quencies, this operational independence condition is re-expressed by = Eij, where the 
denote the observed frequencies of all bivariate category combinations (a*, bj) in a cross tabula- 
tion underlying a specific random sample of size n, and the quantities E^, which are defined in 
terms of (i) the univariate sum O i+ of observed frequencies in row i, see Eq. (14.31) . (ii) the uni- 
variate sum + j of observed frequencies in column j, see Eq. (14.41) . and (iii) the sample size n 

by := — — , are interpreted as the expected frequencies of (a*, bj) given that X and Y 
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are statistically independent. Expressing deviations between observed and (under independence) 
expected frequencies via the residuals — E^, the hypotheses may be reformulated as 

Hypotheses: (test for association) 

o tj .j _ (12.27) 

H\ : — E^ 7^ 

For the subsequent test procedure to be reliable, it is very important (!) that the empirical prereq- 
uisite 

E^ > 5 (12.28) 

holds for all values of i = 1 . . . , k and j = 1, . . . ,1 such that one avoids the possibility that indi- 

(Oij — Eij)^ 

vidual rescaled squared residuals —- — - — become artificially magnified. The latter constitute 

E^ 

the core of the 

Test statistic: 



(Oij - Eij) 2 Ho 



E, n 

i=i j=i % j 



(12.29) 



which, under H , approximately satisfies a ^-distribution with df = (k — 1) x (I — 1) degrees of 
freedom; cf. Sec. 18.61 

Test decision: The rejection region for H at significance level a is given by (right-sided test) 

tn > X(fc-l)x(2-l);l-a ■ (12.30) 

By Eq. (110.131) . the p-value associated with a realisation t n of (112.291) amounts to 

p = P(T n > t n \H ) = 1 - P(T n < t n \H ) = 1 - x 2 cdf (0, t n , (k — 1) x (I — 1)) . (12.31) 

GDC: mode STAT -> TESTS ->■ % 2 -Test . . . 

SPSS: Analyze — > Descriptive Statistics — > Crosstabs ...—>■ Statistics . . . : Chi-square 
R^chisq.test ( row variable , column variable ) 

The % 2 -test for independence can establish the existence of a significant association between vari- 
ables X and Y. The strength of the association, on the other hand, may be measured in terms of 
Cramer's V (Cramer (1946) [8 1), which has a normalised range of values given by < V < 1; cf. 
Eq. (14.351) and Sec. l4.4[ Low values of V in the case of significant associations between variables 
X and Y typically indicate the statistical influence of additional control variables. 

SPSS: Analyze — > Descriptive Statistics — > Crosstabs ...—»■ Statistics . . . : Phi and Cramer's V 
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Appendix A 



Principle component analysis of a (2 x 2) 
correlation matrix 



Consider a real- valued (2 x 2) correlation matrix expressed by 

R =(l [) > -I<r<+1, (A.l) 

which, by construction, is symmetric. Its trace amounts to Tr(i?) = 2, while its determinant 
is det(.R) = 1 — r 2 . Consequently, R is regular as long as r ^ ±1. We seek to determine 
the eigenvalues (or principle components) and corresponding eigenvectors (or directions of the 
principle axes) of R, i.e., real numbers A and real-valued vectors v such that the condition 

Rv = Xv (R-Xl)v = (A.2) 

applies. Solution of this algebraic problem leads to the characteristic equation 

= det(R - XI) = (1 - A) 2 - r 2 = (A - l) 2 - r 2 . (A.3) 
Hence, it is clear that R possesses the two eigenvalues 

X 1 = l + r and A 2 = 1 - r , (A.4) 

showing that R is positive-definite whenever |r| < 1. The normalised eigenvectors associated 
with Ai and A2, obtained from Eq. (IA.2I ). are then 

1 = ^(1) and „ 2 = -L(-;), (A.5) 

and constitute a right-handedly oriented basis of the two-dimensional eigenspace of R. Note that 
due to the symmetry of R it holds that vj ■ v 2 = 0. 

The normalised eigenvectors of R define a regular orthogonal transformation matrix M , with 
inverse M^ 1 = M T , given by 

M =7iG "0 md M ~ 1= 7i(-i =MT - (A ' 6) 
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where Tr(JVf) = y/2 and det(JW) = I. The correlation matrix R can now be diagonalised by 
means of a rotation with M according tcQ 

#diag = M^RM 

If 1 rU /l -1\ /1+r \ 

~ V2\~l I ){r I ) ^{l l-r ) ■ (AJ) 

Note that Ti(R diag ) = 2 and det(-Rdiag) = 1 — r 2 , i.e., the trace and determinant of R remain 
invariant under the diagonalising transformation. 

The concepts of eigenvalues (principle components) and eigenvectors (principle axes), as well 
as of diagonalisation of matrices, generalise in a straightforward though computationally more 
demanding fashion to arbitrary real-valued correlation matrices R e M mxm , with m G N. 



Alternatively one can write 

( cos(7r/4) - sin(7r/4) \ 
y sin(7r/4) cos(7r/4) J 

thus emphasising the character of a rotation of R by an angle ip = it /A. 



Appendix B 

Distance measures in Statistics 



Statistics employs a number of different measures of distance dij to quantify the separation in 
an m-D space of metrically scaled statistical variables X, Y, . . . , Z of two statistical units % and 
j (i, j — 1, . . . , n). Note that, by construction, these measures exhibit the properties d^ > 0, 
d^ = dji and da — 0. In the following, X^ is the entry of the data matrix X £ ]R nxm relating to 
the ith statistical unit and the kth statistical variable, etc. The d^ define the elements of a (n X n) 
proximity matrix D e M nxn . 



Euclidian distance 



(dimensionful) 



This most straightforward, dimensionful distance measure is named after the ancient Greek (?) 
mathematician Euclid of Alexandria (ca. 325BC-ca. 265BC) , It is defined by 




(B.l) 



where 5 M denotes the elements of the unit matrix 1 e R mxm ; cf. Ref. Q2l Eq. (2.2)]. 



Mahalanobis distance 



(dimensionless) 



A more sophisticated, scale-invariant distance measure in Statistics was devised by the Indian 
applied statistician Prasa nta Chandra Ma halanob is (1893-1972)^ cf. Mahalanobis (1936) J35l. It 
is defined by 



M 



d M 



\ 



y^^tk - Xj k )s k i{xu - x 



k=l 1=1 



where S,, 1 denotes the elements of the inverse co variance matrix S 



-i 



(B.2) 



relating to 



X,Y,...,Z;cf. SubsecEH] 
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Appendix C 

Glossary of technical terms (GB 



A 

ANOVA: Varianzanalyse 
arithmetical mean: arithmetischer Mittelwert 
association: Zusammenhang, Assoziation 
attribute: Auspragung, Eigenschaft 

B 

bar chart: Balkendiagramm 
Bayes' theorem: Satz von Bayes 
best-fit model: Anpassungsmodell 
binomial coefficient: Binomialkoeffizient 
bivariate: bivariat, zwei variable GroBen betreffend 

C 

category: Kategorie 

causal relationship: Kausalbeziehung 

census: statistische Vollerhebung 

central limit theorem: Zentraler Grenzwertsatz 

certain event: sicheres Ereignis 

class interval: Auspragungsklasse 

cluster random sample: Klumpenzufallsstichprobe 

coefficient of determination: BestimmtheitsmaB 

coefficient of variation: Variationskoeffizient 

combination: Kombination 

combinatorics: Kombinatorik 

compact: geschlossen, kompakt 

concentration: Konzentration 

conditional probability: bedingte Wahrscheinlichkeit 
confidence interval: Konfidenzintervall 
contingency table: Kontingenztafel 
continuous data: stetige Daten 
convexity: Konvexitat 
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correlation matrix: Korrelationsmatrix 
covariance matrix: Kovarianzmatrix 

cumulative distribution function (cdf ): theoretische Verteilungsfunktion 
D 

data matrix: Datenmatrix 
deductive method: deduktive Methode 
degrees of freedom: Freiheitsgrade 
dependent variable: abhangige Variable 
descriptive statistics: Beschreibende Statistik 
deviation: Abweichung 
direction: Richtung 
discrete data: diskrete Daten 

disjoint events: disjunkte Ereignisse, einander ausschlieBend 
dispersion: Streuung 
distance: Abstand 
distortion: Verzerrung 
distribution: Verteilung 

distributional properties: Verteilungseigenschaften 
E 

elementary event: Elementarereignis 

empirical cumulative distribution function: empirische Verteilungsfunktion 
estimator: Schatzer 

Euclidian distance: Euklidischer Abstand 
event: Ereignis 
event space: Ereignisraum 
expectation value: Erwartungswert 

F 

factorial: Fakultat 
falsification: Falsifikation 

five number summary: Funfpunktzusammenfassung 
frequency: Haufigkeit 

G 

Gini coefficient: Ginikoeffizient 
goodness-of-the-fit: Anpassungsgiite 

H 

Hessian matrix: Hesse'sche Matrix 
histogram: Histogramm 

homoscedasticity: Homoskedastizitat, homogene Varianz 
hypothesis: Hypothese, Behauptung, Vermutung 

I 

independent variable: unabhangige Variable 
inductive method: induktive Methode 
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inferential statistics: SchlieBende Statistik 
interaction: Wechselwirkung 
intercept: Achsenabschnitt 
interquartile range: Quartilsabstand 
interval scale: Intervallskala 
impossible event: unmogliches Ereignis 

J 

joint distribution: gemeinsame Verteilung 
K 

/ccr-rule: /ccr-Regel 
kurtosis: Wolbung 

L 

latent variable: latente Variable, Konstrukt 

law of large numbers: Gesetz der groBen Zahlen 

law of total probability: Satz von der totalen Wahrscheinlichkeit 

linear regression analysis: lineare Regressionsanalyse 

Lorenz curve: Lorenzkurve 

M 

Mahalanobis distance: Mahalanobis'scher Abstand 
manifest variable: manifeste Variable, Observable 
marginal frequencies: Randhaufigkeiten 
measurement: Messung, Datenaufnahme 
median: Median 
metrical: metrisch 
mode: Modal wert 

N 

nominal: nominal 
O 

observable: beobachtbare/messbare Variable, Observable 
observation: Beobachtung 

operationalisation: Operationalisieren, latente Variable messbar gestalten 
opinion poll: Meinungsumfrage 
ordinal: ordinal 
outlier: AusreiBer 

P 

p-value: p-Wert 
partition: Zerlegung, Aufteilung 
pie chart: Kreisdiagramm 
point estimator: Punktschatzer 
population: Grundgesamtheit 
power: Teststarke 
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power set: Potenzmenge 

principle component analysis: Hauptkomponentenanalyse 
probability: Wahrscheinlichkeit 

probability density function (pdf ): Wahrscheinlichkeitsdichte 
probability function: Wahrscheinlichkeitsfunktion 
proximity matrix: Distanzmatrix 

Q 

quantile: Quantil 
quartile: Quartil 

R 

random sample: Zufallsstichprobe 
random experiment: Zufallsexperiment 
random variable: Zufallsvariable 
range: Spannweite 
rank: Rang 

ratio scale: Verhaltnisskala 
raw data set: Datenurliste 

realisation: Realisation, konkreter Messwert fur eine Zufallsvariable 

regression analysis: Regressionsanalyse 

regression coefficient: Regressionskoeffizient 

rejection region: Ablehnungsbereich 

research question: Forschungsfrage 

residual: Residuum, RestgroBe 

risk: Risiko (berechenbar) 

S 

(j-algebra: cr-Algebra 
sample: Stichprobe 

sample correlation coefficient: Stichprobenkorrelationskoeffizient 

sample covariance: Stichprobenkovarianz 

sample mean: Stichprobenmittelwert 

sample space: Ergebnismenge 

sample variance: Stichprobenvarianz 

sampling distribution: Stichprobenverteilung 

sampling error: Stichprobenfehler 

sampling frame: Auswahlgesamtheit 

sampling unit: Stichprobeneinheit 

scale-invariant: skaleninvariant 

scale level: Skalenniveau 

scatter plot: Streudiagramm 

shift theorem: Verschiebungssatz 

significance: Signifikanz 

significance level: Signifikanzniveau 

simple random sample: einfache Zufallsstichprobe 
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skewness: Schiefe 
slope: Steigung 

spectrum of values: Wertespektrum 

spurious correlation: Scheinkorrelation 

standard error: Standardfehler 

standardisation: Standardisierung 

statistical independence: statistische Unabhangigkeit 

statistical unit: Erhebungseinheit 

statistical variable: Merkmal, Variable 

stochastic independence: stochastische Unabhangigkeit 

stratified random sample: geschichtete Zufallsstichprobe 

strength: Starke 

survey: statistische Erhebung, Umfrage 
T 

test statistic: Teststatistik, statistische EffektmessgroBe 
type I error: Fehler 1 . Art 
type II error: Fehler 2. Art 

U 

unbiased: erwartungstreu 
uncertainty: Unsicherheit (nicht berechenbar) 
univariate: univariat, eine variable GroBe betreffend 
um model: Urnenmodell 

V 

value: Wert 
variance: Varianz 
variation: Variation 

W 

weighted mean: gewichteter Mittelwert 
Z 

z-scores: z-Werte 
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